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ABSTRACT 


MODELS FOR EVALUATING THE PERFORMABILITY 
OF DEGRADABLE COMPUTING SYSTEMS 


by 

Liang Tai Wu 


Chairman: John F. Meyer 

Recent advances in multiprocessor technology have established the 
need for unified methods to evaluate computing systems performance and 
reliability. In response to this modeling need, this dissertation considers a 
general modeling framework that permits the modeling, analysis and evaluation 
of degradable computing systems. Within this framework, several user* 
oriented performance variables are identified and shown to be proper 
generalizations of the traditional notions of system performance and reliability. 
Furthermore, a time-varying version of the model is developed to generalize 
the traditional fault-tree reliability evaluation methods of phased missions. 

The modeling and evaluation methods considered in this dissertation 
provide a relatively straightforward approach to integrate reliability and 
availability measures with performance measures. The hierarchical 
decomposition approach permits the modeling and evaluation of a computing 
system’s subsystems (e.g., hardware, software, peripherals, interfaces, user 
demand systems) as a whole rather than the traditional methods of evaluating 
these subsystems independently. Accordingly, it becomes possible to evaluate 
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the performance of the system software and the reliability of the system 
hardware simultaneously in order to measure the effectiveness of the system 
design. Moreover, since the performance variables considered in this study 
permit the characterization of system performance according to the application 
needs of a system, the results obtained represent more accurate assessments of 
the system’s ability to perform than the existing performance or reliability 


measures. 
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CHAFFER 1 


INTRODUCTION 


1.1 Background 

The recent developments in multiprocessor systems have stimulated 
a growing interest in degradable computing systems that are designed to 
provide a high degree of performance and reliability by reallocating the 
computer’s resources when faults are detected. To assess the effectiveness of 
these computing systems, it has been found that the traditional way of 
evaluating the performance and the reliability as distinct attributes of a 
computer is no longer adequate [l], [2]. Traditional performance evaluation 
methods generally assume that the computer to be evaluated is fault free and 
are concerned with the quantification of the effectiveness in which the 
computer’s resources handle a specific application (see [3] and [4], for 
example). Traditional reliability evaluation methods, on the other hand, deal 
with the measurement of a computer’s ability to remain operational in the 
event of physical failures (see fi] through [8]). Since the level of performance 
of a degradable computing system may decrease vith successive failures, the 
performance and the reliability of the system must be dealt with simultaneously 
to measure the extent to which the user can benefit from the tasks 
accomplished by the computer. 
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In response to the above modeling need of degradable computing 
systeiris, some recent investigations have attempted to formulate new modeling 
and evaluation methods that combine both the performance and the reliability 
characteristics of computing systems. Particularly, Beaudry [9] has considered 
performance-reliability measures that reflect the computational capacity of a 
system, defined as the amount of useful computations available per unit of 
time, and has shown that these measures can be evaluated in terms of a 
transformed Markov process. By examining the set of jobs executed by a 
computing system. Mine and Hatayama [10] have considered the reliability of 
the system with respect to a specific job, called job-related reliability. Although 
the above models have shown the feasibility of combining the performance and 
reliability measures into a single measure, their efforts have focused mainly on 
the maximum capacity at which a computer can handle its computation. The 
effect of interactions between the demand for computation (by the user) and its 
supply (by the computer) has not been considered explicitly. 

Another approach to quantifying the unified performance and 
reliability of computing systems is based on Markov rewa 1 processes [11]. By 
assigning a throughput rate to each state of a Markov process that describes the 
resource availability of a computing system. Gay [12] has consideied the 
expected system throughput and the throughput availability of a system. A 
similar model has also been used in [13] by De Souza to estimate the reduction 
in operating cost when fault-tolerance features are incorporated in commercial 
systems. More recently, based on renewal process models, Castillo and 
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Sicwiorek [14] have considered the apparent capacity and expected elapsed time 
required to execute a program correctly. 

In contrast to the above efforts to formulate specific performance 
measures for degradable computing systems, Meyer [l] has developed a general 
modeling framework that permits the definition, formulation and evaluation of 
user-oriented performance measures. A hierarchical model is defined [l] which 
assumes that the probabilistic nature of the total system S (the computer and 
its environment) is modeled by a stochastic process It is further assumed 
that the process X s can be used to determine the probability distribution 
function of a random variable Y s which describes the user’s view of how well 
the system performs. The probability distribution function of Y$ is shown to 
induce a useful performance measure, referred to as the performability of S, in 
the context of degradable computing system performance. 

1.2 Research Objectives 

In this investigation, among other things, we wish to extend the 
modeling framework in [1] to provide a more concrete basis for studying the 
evaluation of degradable computing systems. By introducing extra ingredients 
to the modeling framework, we wish to develop a general stochastic process 
model of degradable computing systems that satisfies the following objectives: 

(!) The model should be general enough to permit uniform formulation of 


different performance measures. 



(2) The model should be specific enough to permit derivations of 
. computational algorithms and formulas. 

(3) The model should be flexible enough to be related to traditional 
performance and reliability models so that it may serve as a basis for 
unifying traditional computing system evaluation methods. 

(4) The model should be able to reflect the information processing needs of 
the user as well as internal structural changes of the system caused by 
component failures. 

In addition to the above efforts of model development, we also wish 
to apply the results obtained to evaluate a large class of fault-tolerant 
computing systems known as ”degradable multiprocessor systems.* By 
comparing the effectiveness of various design strategies, we wish to illustrate 
the tradeoffs between different techniques of incorporating fault-tolerance in 
the design of a multiprocessor system. 

Chapter 2 puts this work in context with respect to the general 
modeling framework considered in [1]. It describes the components of a 
performability model and formalizes the relationships among these components. 
The major results of this chapter include the precise formulation of the notion 
of system performance in a broad context and the clarification of the notion of 
supporting the evaluation of system performance using a stochastic process 


model. 
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Chapter 3 introduces a general notion of recoverability and 
establishes necessary and sufficient conditions for an operational model to be 
recoverable. For both the recoverable and the nonrecoverable models, it 
examines the solution methods of a generally defined performance variable 
where the ,/crformance is identified with the minimum value of a functional. 
The modeling approach and the evaluation methods are then illustrated through 
the evaluations of a multiprocessor system. The results obtained indicate that 
the performance variable is, indeed, a proper generalization of the traditional 
notions of the system performance and reliability. The modeling and the 
evaluation methods proposed thus represent a unifying approach for integrating 
the performance and the reliability measures of computing systems. 

Chapter 4 presents a specific operational model for evaluating the 
performability of degradable multiprocessor systems. The model is constructed 
according to a hierarchical ic ■imposition of a system's behavior. A Markovian 
base model is developed to represent the resource availability of the system, 
and priority queueing models are used to determine the operational rates of the 
resource states. The model not only demonstrates the generality of an 
operational model but also illustrates the feasibility of modeling and evaluating 
the system performance via a step-by-step hierarchical approach. The methods 
developed in this chapter thus represent a straightforward approach to produce 
a composite picture of a computer’s ability to meet overall throughput goals. 


Chapter 5 extends the concept of an ope atioral model to phased 
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missions where the environment of a system can vary in time. Both the 
combinatorial and the probabilistic properties of the extended model are 
examined in detail. In addition, an example is constructed to illustrate the 
performability valuation of a phased mission with multiple accomplishment 
levels. The results obtained in this chapter represent an important step toward 
the understanding and the development of a more general time-varying 
operational model. 

Chapter 6 summarizes the results of this study and suggests topics 


for further research. 



CHAPTER 2 


PERFORMABILITY EVALUATION OF COMPUTING SYSTEMS 
2.1 Iatrodactioa 

The concept of hierarchical organization has become an important 
tool in the Jesign of computing systems. Hardware components are typically 
formed by putting together some basic modules or building blocks, and 
software components are often structured into subroutines in a top-down 
manner. By carefully organizing the structure of a computer into a hierarchy of 
components, it becomes possible to increase greatly the capability and the 
functional features of the computer. Although this concept of hierarchical 
organization has been used extensively in the design of computing systems 
since the invention of the first electronic computer, its implication in computing 
systems performance evaluation has only been exploited recently. 

As suggested in [2], a computing system can be described by a 
hierarchy of system models that vary in "scope* and "level of abstraction* (see 
Figure 2.1 for an example of what we call a model hierarchy). In this 
representation, a higher level model has a larger scope and a higher level of 
abstraction i.e., it describes a larger portion of the computer and its 
environment, but possibly in less amount of detail. In particular, the top model 
has a scope that includes all the subsystems that can influence the 
computational process of the system (e.g., hardware and software, peripherals. 
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interfaces, maintenance systems, user demand system, etc., collectively referred 
to as the total system). The level of abstraction at the top level is expressed in a 
form easily usable by the system user. On the other hand, the bottom model 
may involve low level representations of the computer’s hardware and 
operating system structure. 

Based on the above model hierarchy of the total system, various 
performance and reliability measures can then be associated with models at 
each level of the hierarchy. The part of the total system that one is interested 
in evaluating must be identified first with a specific level in the hierarchy 
(referred to as the object system). The part of the total system outside the 
object system is then regarded as the environment of the object system. The 
choice of an object system is, to a large extent, determined by the particular 
problem one is interested in solving. For example, if the performance or 
reliability of a data-base system is to be evaluated, the object system will not 
only include the hardware and operating system but also the data bases and 
their supporting programs. 

Once a specific object system is selected, the performance of the 
system can then be defined as how well the object system satisfies the 
computational demands (also referred to as the workload) imposed by its 
environment [4]. The performance is typically considered as a random variable 
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referred to as a performance variable or performance index. Thus we can talk 
about the mean, variance, distribution function, and the like of a performance 
variable. 

If we regard computer performance measures to be the 
measurements of the quality of a computer according to the above broad 
definition, some observations about the relationships among different 
performance measures can now be made. First, we note that there are no 
essential differences between what are traditionally called performance 
measures (e.g., throughput rate, response time or utilization rate) and what are 
traditionally called reliability measures (e.g., reliability, availability or 
maintainability). They differ only in the way performance criteria are formed. 
Any performance or reliability measures, viewed in the broadest context, must 
account for both the workload and the probabilistic nature of the object system. 
Accordingly, in the discussion that follows, the term performance measure will 
be used to include both the performance and the reliability aspects of a system. 
Second, we also note that different performance measures can be associated 
with different levels of the model hierarchy. Thus, for each level of the 
hierarchy, it is possible to formulate various performance measures according 
to the application needs as well as the modeling requirements of the object 
system. 

In the following section, we first define more precisely the basic 
elements of a performance study by considering the basic components of a 
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performability model. The relationships among those components are then 
formalized via the concept of capability functions. Finally, the notion of 
supporting the evaluation of system performance using a stochastic process is 
made precise by relating the probabilistic nature of the performance variable to 
the known properties of the underlying stochastic process. 

2.2 System Models 

A major objective of this section is to formulate precisely what we 
mean by a "stochastic model" for system performability. It is assumed that the 
total system S— (C,E) contains a computer C operating in an environment E. 
The computer C is composed of several processors, memory modules, 
input/output devices, buses, etc., and the environment E includes man-made 
components (e.g., interface circuits and peripheral subsystems), operational 
rules (e.g., job submitting policies and maintenance procedures) and other 
conditions (e.g., weather) that can influence the computer’s effectiveness. At 
this level of abstraction, it is appropriate to view S as a network of 
interconnected subsystems with simultaneous information flow among 
subsystems. Accordingly, S can be described as an autonomous state transition 
system that changes state due to events occurring in time. 

Given the above characterization of the total system, the behavior 
of S can be viewed as a stochastic process 
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X s -{X t |t«T} (2.1) 

\ 

where T is the time range involved (called the utilization period) and, for each 
t«T, 


X t :Q — Q 

is a random variable defined on a common probability space (Q,£,P) and 
taking values in the state space Q of the total system. In the following 
discussions, it will be assumed that T is a set of real numbers and Q is a 
discrete set. Thus, without los of generality, the states of X s will often be 
named by positive integers, viz. 


Q- {1,2,3,...} 

or, when Q is finite, 

Q - {1,2,..., n) . 

The stochastic process X s will be referred to as the base model of S and is 
denoted simply as X when the system context is clear. 

Although the base model X provides a detailed description of the 
system's state behavior, the description is generally invisible to the users. It is 

assumed that the users are concerned only with distinguishing different 'levels 
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of accomplishment” when judging how well the system has performed. 
Accordingly, the user’s view of the system’s behavior can be formulated as a 
random variable with respect to the underlying probability space (12 ,£,P), i-e., 

Y:Q — * A (2.2) 

where, for each a>«Q,Y(«) takes a value in the accomplishment set A. 
Depending on the application, the accomplishment set A can be any set of real 
numbers where the elements of A are taken to be the various degrees of user 
satisfaction such that a>b if a is preferred over b (i.e., the ordering relation > 
as implied by the user preference coincides with the natural ordering of real 
numbers). For example, to evaluate the reliability of a nondegradable system, 
the accomplishment set can be taken to be A ■*{0,1} where 1 ■* "system 
success* and 0 * "system failure." On the other hand, if the user is interested 
in evaluating the system throughput, A can be taken to be an interval of real 
numbers. The random variable Y will be referred to as a performance variable 
of S. 

As generally defined above, the performance variable Y clearly can 
be used to characterize either the performance or the reliability aspects of a 
system. Thus a natural measure that can be used in the evaluation of 
computing systems is the probability measure induced by Y (e.g., see [13] 
p.97). This unified performance-reliability measure is referred to as the 
performability of S which, in terms of our modeling framework, can be defined 
as the function perf s where, for each measurable set B£A (i.e., 
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{w|Y(ci>)eB}c£) of accomplishment levels, 

perfs(B) - P({o)|Y(«)«B}), (2.3) 

i.e., perfs(B) is the probability that S performs at a level in B. The requirement 
that B be measurable insures the existence of the probability on the right side 
of (2.3). 

In theory, it is possible to determine the performability of S from 
the underlying probability space (fl,£,P) of the performance variable Y. 
However, in practice, the underlying probability space is generally unknown 
and, consequently, the performability must be determined from known 
properties of the base model X. Hence an important step to determine the 
performability of S is to establish relations between the base model X and the 
performance variable Y based on the given properties of X. 

Following a common practice in probability theory, we assume that 
the base model X is specified by its finite-dimensional distributions or by 
information that determines these distributions (e.g., Markov assumptions 
together with a transition function and an initial distribution). Based on these 
finite-dimensional distributions, we then construct a 'coordinate probability 
space* (see [16] or [17] for the details of this construction) and express the 
performance variable Y in terms of an equivalent random variable defined on 
the new probability space. The advantage of this approach is that the resulting 
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probability space is considerably structured and, hence, questions about the 
probabilistic nature of X and Y can be addressed more easily by dealing with 
the coordinate probability space. The construction of this probability space 
described in [16] is summarized as follows. 

Suppose that the base model X is described by a family of finite 
dimensional distributions 


*“{F t | t,, . . . .VTand n«N } (2.4) 

where T is the utilization period and N is the set oi all natural numbers. Then 
the coordinate probability space is a probability space (U,f ,Pr) where 


1. The coordinate sample space U is the set of all functions u:T — » Q where 
Q is the state space of X. In other words, U is the [ij-dimensional direct 
product of the state space Q. 

2. To construct the event space F, let B° be the smallest v-algebra 
generated by the relative topology of Q n (also referred to as the topology 
of Q n induced by the n-dimensional Euclidean space). For each set B in 
B n and given t|,...,t„ in T, let 


C - { u.U I f«(t,) u(t,)]«B J, 


( 25 ) 
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i.c., C is the set of all functions u in U such that the values of u at q 
(l<i<n), when regarded as an n-tuple, is an element in B. If we let F 0 
be the set of all C such that C is obtained from (2.5) for all n«N, all 
B«B n , and all ti,...,t a <T, then F 0 is a field. Finally, the event space F is 
taken to be the completion of the smallest a -algebra containing F 0 . 

3. To construct the probability measure Pr, we first define a measure p on 
F 0 such that for each CeF 0 , 


m(C) - / ••• JdF t ,„) (2.6) 

B 

where C is generated by the set B with indices t lt . . . ,t a (see (2.5)) and 

F t( tm is a multivariate distribution in $ (see (2.4)). Then the 

probability measure Pr is the completed version of the above measure n. 


Given the coordinate probability space (U,F,Pr) of X, we can 
construct an equivalent process of X defined on (U,/',Pr) such that both 
processes have the same multivariate distributions $. More precisely, let us 
define 


X - (XjwT) 


where, for all t<T and all ucU, 
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*t(u) - u(t). 


(2.7) 


Furthermore, for all B in B n and all tj,...,t K in T, let us assign (see (2.6)) 


/ * * * / dF t, • • • .la) 

B 

to be the probability of the event 


fu.Ul [X,,(u) X t .(u)],B ). 

Then, for each tcT, X, is a function from the set U to the set Q and the family 
of functions X is a stochastic process defined on (U,F,Pr) having the 
multivariate distributions * (see [17], pp. 10-11). Since, for each u«U, u(t) 
specifies the state of X at t (see (2.7)), each element in U will be referred to as 
a state trajectory and the set U will be referred to as the trajectory space. 

The notion of a coordinate probability space permit us to answer 

questions about the probabilistic nature of X by relating the questions to the 
* 

state behavior of X. In particular, for each fixed well of the underlying 
probab : lity space (Q,£,P) let us define a function u w : T — * Q such that, for all 
t«T, 


u w (t) - XtCw). 


( 2 . 8 ) 
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Moreover, given an event V in F , let 


W - («|u„«V) (2.9) 

be a subset of 0. Then, it it known (see [17], pp. 621-622) that W it an event 
in E and that 


P(W) - Pr[V] . (2.10) 

On the other hand, given a subset W of Q measurable with respect to the 
induced probability space of X, there exists an event V in F such that (2.10) is 
satisfied. 

In the following discussions, we consider the question of what we 
need to know about X in order to determine the distribution function of the 
performance variable Y. Formally, we says that X supports Y if there exists a 
random variable 


7:U — * A (2.11) 

defined with respect to the coordinate probability space (U,/\Pr) such that for 
each ut 0 


Y(«) - y{u J 


( 2 . 12 ) 
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where u w is the state trajectory associated with the outcome w. Since y can be 
regarded as the user's performance criteria for judging the "capability* of the 
total system, it is referred to as the capability Junction of S. 

When X supports Y, the capability function permits us to determine 
the perfoxw/oility of S using the finite-dimensional distributions of X. To 
substantiate this claim, let us define a function 


h:Q — U 


such that, for all v Q , h(a>)—u w where u w is the state trajectory associated with 
the out >mc u. Then, by (2.12), X supports Y implies 


Y - T h , 

i.e., Y is the functional composition of y and h, applying h first. Accordingly, 
taking the preimage on both side, we have 

Y” 1 - h -i *7 _1 . 


Hence, for any measurable set B£ A, if we let 


V-{u|-Ku)«B)CU 


and 


W-{«|Y(a>)«B}£Q , 


then 


W “ f«|u w «VJ » h -, (V) . 

Accordingly, by (2.10), P(W) — Pr[V], which, in turn, implies 


p,.i s (B) A P({ai|Y(w)<B)) 

-Pr[-T'(B)J. (2.13) 

Since the probability Pr[ 7 -, (B)] can be determined directly from the finite- 

A 

dimensional distributions of X, we have shown that X together with y suffice to 
support an evaluation of the performability perf s . 

In view of what has been observed, if X supports Y, then the pair 
(X, 7 ) is said to constitute a performability model of S. If B is a measurable set 
of accomplishment levels, the inverse image 7 “ , (B) is referred to as the 
trajectory set of B where its determination requires an analysis of how levels in 
B relate back down via y -1 to trajectories of the base model. In the following 
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• » 
discussions, since we will be dealing with X instead of X, the induced process X 

will be called the base model and denoted simply as X. 

Given a performability model (X, 7 ), equation (2.13) permits us to 
evaluate the performability of S for a set B of accomplishment levels by i) 
determining the trajectory set 7 -1 (B) and ii) calculating Pr[ 7 ~ ! (B)]. In general, 
the trajectory set y _1 (B) is difficult to obtain because the distance* between the 
base model X and the performance variable Y may be considerable. The 
difficulty can be alleviated by introducing intermediate models between X and 
Y based on the concept of a model hierarchy discussed in Section 2.1. The use 
of a model hierarchy allows the capability function or, more accurately, the 
trajectory set 7 -1 (B) to be derived step-by-step in a top-down manner from 
more elementary components in a clearly conceived way. In particular, by 
introducing an intermediate model called an "operational model,” we show in 
the following chapter that the performability of S can be determined by 
evaluating the intermediate model. 

Finally, we note that the role of a capability function in 
performability evaluation is similar to that of a structure function [18] in 
reliability evaluation. However, even when performability is restricted to 
reliability, the concept of a capability function is still more general because a 
capability function must take into account the behavior of S throughout the 
utilization period while a structure function is restricted to modeling the 
instantaneous behavior of S at a given moment in time [2]. / 



CHAPTER 3 


OPERATIONAL MODELS 


3.1 IntrodoctioB 

When modeling degradable computing systems by stochastic 
processes for system performance or reliability evaluation, the models used are 
typically Markov processes (see [8], for example) or models which can be 
analyzed in terms of embedded Markov processes (for example, certain 
queueing models such as M/G/l or GI/M/m queues; see [21]). However, to 
ensure the validity of the Markov assumption, it is usually necessary to model 
the structure and behavior of the system at a low level, e.g., a level describing 
the system’s physical resources (processing units, memory units, input buffers, 
etc.). Performance and reliability measures, on the other hand, often quantify 
the system’s behavior in terms of high-level, user-oriented variables 
(throughput, response time, operational status, etc.) which, if viewed as 
stochastic processes, are seldom Markovian. In such cases, an essential part of 
the modeling effort is to establish a "connection" between the low and high 
levels to resolve the probabilistic nature of the measure in question. 

Historically, in the context of reliability modeling, this connection 
has taken a form that lies at one of two extremes. At one extreme, system 
"success" is defined in terms of the underlying structural resources (at least so 
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many fault-free processors, at least so many fault-free memory units, etc.), in 
which case the connection between structure (available fault-free resources) 
and performance (success or failure) is immediate. At the other extreme, the 
object of the modeling effort is the connection, per se, and the resulting model 
is typically some form of event-tree or fault-tree (see [ 18 ], for example). 

In general, as discussed in the previous chapter, the general nature 
of this connection can be formalized as a capability function of the system. In 
this setting, a total system S, comprising a computing system and its 
computational environment, is modeled at a low level by a stochastic process X 
(the base model of S). Then, relative to a high level variable Y (the 
performance of S), the capability function of S is a function 7 which translates 
state trajectories (sample paths) of the process X into corresponding values of 
the performance variable Y. Knowing X and 7, it is possible to solve for the 
probability distribution function of Y and, hence, determine the performability 
of S. 

When the performance variable Y is far removed from the base 
model X, solution procedures can be simplified by introducing intermediate 
model at levels between X and Y. One use of such a model hierarchy is a 
step-by-step formulation of the preimage of 7 beginning at Y and terminating 
at the base model X. If Y is discrete, the performability of S can then be 
evaluated by determining the probabilities of certain trajectory sets that 
correspond (under 7“ 1 ) to performance values of Y. Another role that f "n be 
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played by an intermediate model, and the one we explore in this chapter, is to 
represent the probabilistic nature of S at a level that is higher than the base 
model and thus "closer" to the performance variable. 

To characterize the behavior of an intermediate model. Section 3.2 
introduces a general notion of recoverability and shows that a performance 
process is nonrecoverable if and only if the state behavior of the process can be 
determined by taking a "snapshot" at the end of the utilization period. For both 
the recoverable and nonrecoverable models, Section 3.3 examines the solution 
methods of a generally defined performance variable where the performance is 
identified with the minimum value of a functional. The modeling ana the 
solution methods are then illustrated in Section 3.4 through the evaluation of a 
degradable computing system. The results of the evaluation indicate that the 
performance variable considered in this chapter is a proper generalization of the 
traditional notions of the system performance and reliability. The modeling and 
the evaluation methods considered thus provide a unifying approach for 
evaluating the integrated performance and the reliability of degradable 
computing systems. 

3.2 Recoverability 

Generally, in reliability modeling, a system is said to be repairable or 
nonrepairable according to whether maintenance actions are permitted during 
its utilization to reduce the incidence of system failure or to return a failed 
system to an operating state. The classification is useful because, when a 
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system is nonrepairable, the computation of system reliability at a time t 
amounts to calculating the probability that the system functions at that moment 
in time (see [18], for example). On the other hand, when a system is 
repairable, the computation requires a more complete knowledge of the 
system’s behavior during the entire utilization period T. 

The above classification and properties of reliability models can be 
extended to models of degradable computing systems by considering the way in 
which system performance may change in time. The generalization not only 
permits us to obtain a better understanding of the performance degradation of a 
degradable computing system, but also provides us with a common basis for 
unifying traditional performance and reliability methods. 

To begin, let us define an operational model to be a stochastic process 

Z-(Z,luT) 

with Z{. Q — * Q such that the state space Q of Z is partially ordered by some 
partial ordering <. The partial ordering < can be interpreted as the ranking of 
system states according to the degree of user satisfaction with the system 
operating in a given state (hence the term "operational”). Although operational 
models are introduced here to characterize intermediate-level models, it should 
be noted that operational models can often be defined at the base model level 
with some natural ordering of states. For instance, consider a system S 
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containing m subsystems where each of them can be in one of two operational 
conditions; functioning or failed. Then, one natural way to define a state space 
for S is by taking 


Q “ {0,1}“ (3.1) 

and, assuming no compensating effects of successive failures, the state space 
can be ordered by taking the Cartesian product of the component ordering 
relations, i.e., for all (a^, . . . ,a m ) and (bi,b 2 ,...,b m ) in Q, let 


(*i.a 2 , • • • ,a m ) < (b 1 ,b 2 ,...,b m ) 
if and only if a^bj for all l<i^m . 


(3-2) 


The above ordering of component states is a standard practice in reliability 
theory and plays an important role in fault-tree analysis (see [18], for example). 
The applicability of operational models in modeling degradable computing 
systems will be discussed in more detail in the next section. 

Given an operational model Z, the concept of repairability can be 
extended as follows. We say Z is nonrecoverable if, for all s,t«T(s 2 St) and all 
ij«Q, 


PrU^— i, Z,^j] > 0 implies i ^ j . 
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In less formal terms, an operational model is nonrecoverable iff its operational 
status can only degrade monotonically in time. Moreover, by considering the 
contrapositive form of the above condition, it follows that Z is nonrecoverable 
iff for all s,t«T (s^t) and all iJeQ, 


i >j implies Pr[Z^*n, Z, - ^] * 0. (3.3) 

Similarly, we say Z is recoverable if it is not nonrecoverable, i.e., if there exist 
s,t«T (s<t) and i j«Q, such that 

i j and PrfZ^-i, > 0 . (3.4) 

In other words, Z is recoverable if there is a nonzero probability that the state 
of the system may "recover" from a degraded state i to a higher level state j (j 
> i) or to a noncomparable state j (j i and j 3L i). 

The notion of nonrecoverability can be characterized in a number of 
useful ways, as indicated by the following theorem. 

Theorem 3.1: 

Let Z be an operational model with stave space Q. Then the 


following statements are equivalent: 
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(1) Z is nonrecoverable. 

(2) Eor all s.teT (s:st) and all keQ 

Pr[2^hk, Zt-k] - 0 . 

(3) For all s,teT (s<t) and all keQ 

PrfZ^k, Z^k] - 0 . 

(4) For all s,teT (s^t) and all keQ 

Pr[Z,2rk, Z l 2:k] - PrlZ^k] . 


Proof: 

(1) implies (2): 

Suppose that Z is nonrecoverable. By (3.3), for all s,teT (s<t) and 

all i jeQ, 


i Ik. j implies Pr[2^*i, Zrf] “ 0 . 
Since Q is denumerable, we then have, for all keQ, 

PrlZ^Jc, Zrk] 

- 7 Pr[Z,-i, Zrk] - 0 . 

UL 
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(2) implies (3): 
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Suppose that, for all s,t«T (s^t) and all jcQ, 


Pr[Z*2u, ZtTj] " 0 • 


Then, since Q is denumerable. 


Pr[Z*£k, Z^k] 

- SPrfZ^k.Zrtl 

jfck 

jfck 


(3) implies (4): 

Note first that, for all s,tcT (s<t) and all kcQ, 


PrlZ^k] - Pr[Z,>:k, Z^k) + Pr[Z^.k, Zjzkl 


Hence, for all s,teT (s<t) and all k«Q, 

PrlZ^k, Zt^kl - 0 


(3.5) 


if and only if 
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PrlZ.^k, Z,^k] - PrlZ^k] . 

Thus (3) and (4) are equivalent and, in particular, (3) implies (4). 

(4) implies (1): 

From (3.5), when condition (4) is satisfied, we have, for all 
s.tcT (s:St) and all j<Q, 


P* lZ,2j, ZjS-j] - 0. 


Thus, i^kj implies 


Pr[Z,-i, Zf-j] < PrfZ^fcj, Z^j] - 0 , 

in particular, it implies Pr[Z,~i, Zfj] — 0, i.e., as characterized in (3.3), Z is 
nonrecoverable. 

This circle of implications thus completes the proof of Theorem 3.1. 

An alternative way to characterize the recoverability of an 
operational model is by examining the state behavior of the model over the 
entire utilization period. In this regard, let us restrict our attention to 
operational models that are separable in the sense as defined in [17], i.e., there 
exists a denumerable subset R of T and an event A of probability 0 such that. 
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for any closed interval B and any open interval I in (— oo,+oo), we have 


n (Z..B) - n fZ.«B) C A (3.6) 

IlIR S«IT 

where IR — I p| R and IT * I H T. The set R is referred to as a separability 
set. 

When Z is separable, we are able to show that Z is nonrecoverable if 
and only if its state behavior over any time interval can be summarized by 
observing the state of Z at the end of the interval. 

Theorem 3.2: 

Suppose Z is a separable operational model. Then Z is 
nonrecoverable if and only if. 


for all r.teT (r<t) and all kcQ 
Pr[Z,>k, r<s<t] - PrjZ^k] . 


(3.7) 


Proof: 

Suppose Z is nonrecoverable and, given a state kcQ, let 
E, - {«IZ,(»)2:k) (scT). 


Then, in terms of this notation, (3.7) says 


(3.7)’ 


Pr{ f| Ej - PrlEj . 

F’ nhermore, let. us denote the intersection of two sets A and B by AB. Since 
Z is separable, there exists a denumerable *«*set R of T and a null event A 
such that, for all r,t«T, 


n e, - n e. c a- 

•«(r,t)R »«(r,t) 


Accordingly, 


E,E, 


n e. - n*.' 

t«(r,t)R *«(r,t) 


C a 


and, hence, 


HE. - DUA (3.8) 

•<[r.i]R' *ilr,tl 


where R'-RlJ{r,t}. 

Clearly, the first set in (3.8) is measurable because [r,t]R / is 
denumerable. Thus, under the separability hypothesis, the second set (which is 
contained in the first) is likewise measurable and has the same probability 
(assuming the probability measure Pr is complete; see Chapter 2, p. 16). More 
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precisely, for all r,t«T (r<t), 

Pr[ n E,] “ Pr[ n E,] . (3.9) 

•«(r,tl «[r,t]R' 

Hence, if we can show that the probability on the right side of (3.9) equals 
Pr[Ej, vc establish the desired result, i.e., (3.7)’. 

If we denote 


d- n e,, 

■<(r,t]R' 

then, since te lr,t]R% we have D^DEj. Thus it suffices to show that 


Pr[DE,J - Pr[£j 


or equivalently, 


Pr[DEj - 0. 


(3.10) 


Next, note that 


-34- 


OF POOR 




DEt- 


U E, 

•<[r,t]R' 


U E.E,. 

if[r,t]R' 


Accordingly, we have 


PrtDEj < J PrlEjEj] 

t<(r,t]R' 

or equivalently (in our original notation), 

Pr[DEj S 2 MZ^k, Z,=*k] . 

s«[r,t]R' 

Since Z is nonrecoverable, by Theorem 3.1 (condition 4), each term on the 
right side of the above equation is zero whence 


Pr[DEj - 0 . 

This proves (3.10) and thus establishes the necessity of (3.7). 

To prove that (3.7) is sufficient, suppose (3.7) holds. Then, if we 
let s,t«T and let ktQ, by Theorem 3.1, it suffices to show that 


Pr[Z,>k, Z^k] - PrlZ^k] 
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or, equivalently, using the notations introduced above 


PrlEgEj - PrlEj . 


(3.11) 


The above equality is trivially true when S“t, so let us suppose s<t. By (3.7), 
it follows that 


Pr[ H EJ “ Pr[Et] . 


Then, since 


PrlE.Ej < Pr[Ej 


and 


PrtE.Ej 2£ Pr[ f| EJ “ MEj • 

u«U.t] 


it follows that 


Pr[E,E,] - Pr[Ej 


which establishes the sufficiency of (3.7). 
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Using Theorem 3.2 and the fact that Q is denumerable, 
recoverability can also be characterized by each of the following alternative 
conditions (the proofs are immediate and are omitted): 

(1) For all r,t«T (r<t) and all k«Q 

(3.12) 

Pr[Z,>k, r<s<t] - Pr[Z l >k) . 

(2) For all r,t«T (r<t) and k«Q 

(3.13) 

Pr[Z,>:k, r<s<t, Z^k] - PrlZj-k] . 

Theorem 3.2 provides us with a convenient way for relating the 
concept of recoverability to the traditional notion of repairability. To see this, 
let us define the level-k reliability of Z at time t to be 

R k (t) - Pr[Z,>k, 0<s<t] (3.14) 

and define the level-k a) , ailability of Z at time t to be 

A k (t) - PrfZ^k] . (3.15) 

Clearly, when Q“{0,1} where 0~failure and 1 —success, Rj(t) and Aj(t) 
reduce to the usual notions of system reliability and system availability, 
respectively. Moreover, when Z is nonrecoverable, Theorem 3.2 implies that. 
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for all tcT and all keQ 
A kv c) *■ R k (t). 


(3.16) 


In other words, when Z is nc recoverable, its level-k reliability is reduced to 
the level-k availability for all keQ. The significance of this observation is that, 
when Z is nonrecoverable, the cal culation of the level-k reliability at a time t 
amounts to calculating the probability that the system operates at a level greater 
than or equal to k at that particular moment in time. On the other hand, let 
T— [0,h] and suppose (3 *6‘ holds. Then, since, for all r,t«T (r<t) and all 
k«Q, 


PrlZ,>k, 0<s<t] 2 S Pr[Z,>k, r<s<t] < PrtZ^k] , 
we have A k (t)“R k (t) implies 

Pr[Z,>k, r<s<t] - PrlZ^k] , 

i.e., condition (3.16) is necessary and sufficient for nonrecoverability. Thus, by 
taking the negation of the above condition, we also have the following 
alternative characterization of recoverability: 

Theorem 3.3 : 


Let Z be a separable operational model with a state space Q. Then Z 
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is recoverable if and only if, there exist tcT and kcQ, such that 

A k (t) > R k (t). (3.17) 

Roughly speaking, the above theorem says that Z is recoverable if and only if, 
for some k<Q, the level-k availability is subject to improvement by 
maintenance actions. 

Theorems 3.2 and 3.3 not only provide us with a useful tool for 
characterizing the behavior of operational models, they also provide us with a 
basis for evaluating the performability of degradable computing systems. Each 
of equations (3.14) and (3.15) defines an important class of performance 
measures that are proper generalizations of the traditional notions of system 
reliability and system availability. When the operational model is 
nonrecoverable, both classes convey the same information and the behavior of 
the system can be determined by taking a "snapshot" at the end of the 
utilization period. Motivated by the above properties of an operational model, 
we consider in the following section a single user-oriented performance variable 
that integrates these notions of system reliability and availability. 

3.3 Evaluation of Computing Systems Using Functionals of a Markov Process 

When describing system behavior in user-oriented terms, it is often 
possible to identify various operational "modes” for the system (including a 
failure mode) which result in different degrees of user satisfaction. Moreover, 


kfi 
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for a given mode of operation, the extent of user satisfaction can often be 
quantified as a real number "rate* at which that operation benefits or penalizes 
the user. Depending on the application, these rates can have a variety of 
interpretations relating to the system's productivity, responsiveness, etc., or at a 
higher level, to such things as economic benefit (e.g., the worth rate measured, 
say, in dollars/unit time) associated with a given mode of operation. 

Under the above conditions, a user-oriented model can be 
constructed in a natural way. As in the previous discussions, let S denote the 
total system in question and suppose that we have already determined a base 
model X and a capability function y relative to some specified performance 
variable Y. Suppose further that the base model process X is defined relative 
to a continuous time interval T (the utilization period), that is, 

X«{X t |uT} (3.18) 

where the random variables X t take values in a denumerable state space Q (see 
(2.1) for the definition of a base model). Finally, we presume that at the base 
level, the system model is Markovian with a time-invariant structure, that is, X 
is a continuous-time time-homogeneous Markov process. Unless otherwise 
specified, it will be assumed that Q is countably infinite throughout the 
following discussions. 


Within this framework, let us now consider the situation discussed 
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above where, at a higher level, one is able to identify various operational 
modes for S, each having an associated operational rate. If, further, each state 
of the base model can be classified according to some mode of operation, then 
there is a naturally defined real*valued function 


f:Q — R (3.19) 

where, for each ieQ, f(i) is the operational rate associated with the mode 
containing i. Moreover, if we let Q denote the range of f (i.e., Q—{f(i)|icQ}) 
and, for each variable X t of X (see (3.18)), we let 

Z, - f(X t ) . (3.20) 


It follows that 


Z-{Z,lt«T) (3.21) 

is a stochastic process with state space Q referred to generally as a Junctional of 
the underlying Markov process X (see [22], for example). 

When f is not 1-1 (i.e., some different states have the same mode of 
operation), the derived process Z will typically represent a simpler, higher level 
view of the system and is generally non-Markovian unless certain stringent 
conditions are satisfied. (Conditions under which the derived processes become 
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Markovian are discussed in the Appendix.) To qualify Z as an intermediate 
model, we must also require that Z be compatible with the performance 
variable Y to the extent that the probability distribution function of Y can be 
determined from Z. More precisely, letting x denote the translation of 
trajectories of X to trajectories of Z (i.e., x(u)~u where u(t)—f(u(t)), for all 
t«T), there must exist a capability function y for Z such that 

f* * 7 (3 22) 

where * denotes functional composition, first applying x. Although the above 
condition appears somewhat formidable, it says simply that the higher level 
model Z must remain detailed enough to permit solution of the system's 
performability. This condition can be typically satisfied in practice if the 
definition of performance (i.e., Y) is taken into account when identifying the 
various modes of operation and assigning rates to these modes. 

If f, as defined in (3.19), satisfies condition (3.22) then we refer to f 
as an operational structure of S and, since states inherit the rates assigned to 
n .odes, the value f(i) is referred to as the operational rate of i or, when context 
permits, simply the "rate of i." Likewise, the corresponding functional Z is 
referred to as an operational model of S or, alternatively, a model of S at the 
operational level. 


In reliability modeling where, at the operational level, a system is 
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typically viewed as either operating or not operating, the concept of an 
operational structure reduces to the familiar notion of a structure function [18]. 
Technically, a function f:Q — * R is a structure function if Q has binary 
coordinates, i.e., Q— {0,1}“, and f(i) is 1 or 0 according as S is operating or not 
operating in state i. More recently, operational structures have been employed 
at least implicitly in the context of performance-reliability modeling where the 
operational rates are referred to as computational capacities [9], [12]. Although 
capacity (which typically refers to the maximum rate at which a computer can 
'supply" computations) is a legitimate interpretation of operational rate, it 
should be emphasized that, in general, such rates can represent an interaction 
of supply (by the computer) and demand (from the environment); this is 
because that, as generally conceived, a state i of the base model represents a 
particular status of both the computer and its environment; hence, both supply 
and demand can be accounted for when translating i, via f, to its corresponding 
operational rate f(i). 

In various special forms, then, the concept of an operational 
structure is no stranger to performance and reliability modeling. On the other 
hand, the general nature of associated functional Z, how it relates to the base 
model, how it can be exploited in solution procedures, etc., appear to be 
subjects that deserve further investigation. 

In the following discussions, we focus our investigation on the 
evaluation of performability with respect to a generally defined performance 
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variable. This variable is defined in terms of an arbitrary operational model 
which is generally non-Markovian. However, by relating this variable to the 
underlying Markov process, it is shown that system performability can still be 
evaluated using traditional Markov process methods. The performance variable 
is motivated by the level-q reliability 

R^O-PKfCXJStq, 0<s<St] 

(where q«Q) discussed in Section 3.2. 

Recall that, by (3.16), the operational model Z“{Zjt«T} is 
nonrecoverable if and only if, for all t<T, Zj is the "worst case* rate experienced 
by Z during [0,t]. On the other hand, if Z is recoverable, it was shown in 
Theorem 3.3 that the operational rate at the end of the utilization (i.e., the 
value ZJ will generally not convey the worst case rate. Motivated by the above 
considerations, a performance variable Y t , indicating the worst case operational 
rate during [0,t], can be defined on Z as follows: 

Y t “ min{Z,|0^s<t} . (3.23) 

As defined above, we note first that Y t is a discrete performance 
variable since the base model X has a denumerable number of states and, 
hence, there are a denumerable number of operational rates. Therefore the 
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performability pcrf s of S (sec (2.3)) is simply the probability distribution of Y t , 
i.e.. 


perf s (q) - Pr[Y t -q] . (3.24) 

Before attempting to solve the performability of S, let us consider 
the recoverability of Z in more detail. Since the underlying base model is a 
time-homogeneous Markov process, significant insights can be obtained 
regarding the relationship between Z and X by expressing the recoverability of 
Z in terms of the probabilistic nature of X. 

In this regard, let us restrict our attention to Markov processes X 
which are regular in the sense that their transition probabilities are uniquely 
determined by & generator matrix or, equivalently, a state-transition-diagram 
(see [23], for example). Moreover, borrowing the terminology from [22], we 
say that i leads to j (where i j«Q) and write i—*j if and only if there exists a t>0 
in T such that 


PrlX.-ilXo-i] > o . 


Then, it can be shown that 
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Let Z be an operational model associated with the ba:te model X and 

% 

the operational structure f. Furthermore, let X be a time-homogeneous 
Markov process. Then, Z is recoverable if and only if, for some s«T and some 
iJcQ, both of the following conditions are satisfied 


(1) Pr[X,-i] > 0 

(2) i — * j and f(i) < f(j) . 


(3.25) 


Proof: 

By (3.4), Z is recoverable if and only if there exist s,t«T (s<t) and 
q.reQ. such that 


i < j and Prl^—q, Zft] > 0 . (3.26) 

(Here, since Q is a totally ordered set, we are able to replace ^ with < in 
(3.4).) Now, since Q is denumerable, 

Pr[Z,-q, Zrr] 

- 2 PHX -i, X t -j] , 


and, hence, equation (3.26) holds if and only if, for some ij«Q, 
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f(i)— q, f(j)"*r and Pr[X,-*i, Xj-j] > 0 . (3.27) 


Moreover, since 


Pr[X,— i, X^l > 0 if and only if 
Pr[X,—i] > 0 and i — ♦ j , 


(3.28) 


it follows that the conditions stated in (3.25) are necessary and sufficient for Z 
to be recoverable. 

By taking the negation of (3.25), similar result can alsc '* ved 
to characterize the nonrecoverability of Z as follows: 


Corollary: 

Let Z be an operational model associated with the base model X and 
the operational structure f. Moreover, let X be a time-homogeneous Markov 
process. Then, Z is nonrecoverable if and only if, for all i j«Q and all s<T, at 
least one of the following conditions is satisfied: 


(1) Pr[X,-i]-0 

(2) i — * j implies f(i) > f(j) . 


(3.29) 


Note that if we assume Q is the minimal state space of X in the sense 
as defined in [22], i.e., for all i<Q, there exists an s in T such that Pr[X l aa i]>0, 
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then the first conditions in (3.25) and (3.2^) can both be eliminated. To show 
this, we first observe that 


PrtX* ri] > 0 if and only if 
PrlXo-k] > 0 and Pr[X l -i|X 0 -k! > 0 

for some k<Q. Now since X is regular, the transition probability 
Pr[X t H|X 0 ""k]>0 as a function of t vanishes either everywhere or nowhere in 
T (see [23], p. 240). Thus, it must be the case that Pr[X t — i|Xo™k]>0 for all 
tcT. Clearly, it then follows that 

Pr[X t *i] PrtXo-kJ PrlXt-ilXo-k] > 0 


for all teT. 

The recoverability of Z can also be characterized in terms of 
partitions induced by f on the state space Q. Note first that the binary relation 
— * induces an equivalence relation on Q as follows (see [22], for example): we 
say i communicates with j (denoted i — * j) if and only if i — • j and j — * i. Let [i] c 
be the communicating class containing i. Then the partition of Q induced by 
the communication relation can be denoted as a set 


T 


flij.li.Q) • 


(3.30) 
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Note also that the operational structure f:Q — R induces another partition on Q 

»f-{[ilfli«Q} ' (3.31) 

such that ij belong to the same equivalence class if f(i)~f(j). In terms of the 
above partitions and assuming that X is a time-homogeneous Markov process 
with minimal state space Q, we can then show that 

Lemma: 

If Z is nonrecoverable, then x c is finer than Xf (denoted x c <x r ). 
Proof: 

Suppose that i and j belong to the same block in x c . It must be the 
case that i*->j, which in turn, implies that f(i)>f(j) and f(j)>f(i) because Z is 
nonrecoverable. Thus, it follows that f(i)”f(j), i-e., i and j belong to the same 
block in x f . 

The converse of the above lemma is generally not true. For 
example, let X be a Markov process with transition graph 
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and initial distribution Pr[Xo— l]“l. Suppose the operational structure f:Q — * R 
is given by f(l)“0 and f(2)~f(3)“l. Then 

*c * {{1}*{2,31J - r f . 

However, Z is recoverable, because 1 — * 2 but f(l)<f(2). 

The above example also suggests a necessary and sufficient condition 
for Z to be a nonrecoverable model. Under the same assumptions as those for 
the above lemma, we first define a partial ordering of the set t c : For all [i] c 
and [j] c in t c , let 


Ei] c (jlc if »“»j* ( 3 - 32 ) 

Clearly, the partial ordering as a relation is reflexive, transitive and 
antisymmetric. Furthermore, let us define a mapping 
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t, r. 

Oi- 



( 3 . 33 ) 


such that h(li] c )“f(i). Then, 

Theorem 3.5: 

Z is nonrecoverable if and only if h is well-defined and order 

preserving. 


Proof: 

Suppose Z is nonrecoverable. Then, by the above lemma, x c <Tf. 
In other words, for all i and j in Q, i ► j implies f(i)“f(j) and, hence, 
h([i] c )—f(i) is well-defined. Moreover, suppose [i3 c — ► Ulc* Then, by ( 3 . 30 ), 
we have i — » j and, hence, 

h([i] c ) “ f(i) f(j) - h(ljJc) . 
i.e., h is order-preserving. 

Conversely, let us suppose that h is well-defined and order- 
preserving. Then, for all i j«Q, 


i — * j implies [i] c — * (j] c . 
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Now since h is well-defined, we have 


h([i] c ) - f(i) and h(|j] c ) - fO) . 


Applying the order-preserving assumption of h, it then follows that 


f(i> - han^ > MBy - fa) 


i.e., Z is nonrecoverable. 


Corollary: 

If Z is nonrecoverable, then the following diagram commutes 


f 




Theorem 3.5 permits us to determine whether an operational model 
is nonrecoverable by comparing the state diagram of the underlying Markov 
process with the operational structure. To illustrate, let us consider a base 
model X with the following state diagram 
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and initial distribution PrlXg— 2]—l . Suppose the operational structure is given 
by f(2)-f(l)-l and f(0)-0. Then, T c -{{0},{l,2}}-ir f , { 1,2 } — {0} and 
h({l,2})— 1 > h({0})~0. Thus, by Theorem 3.5, Z is nonrecoverable. 

It should also be noted that a Markov process X may induce a 
nonrecoverable model with respect to an operational structure but a recoverable 
model with respect to other operational structures. For example, using the 
same base model as above and the operational structure g(2)-*2, g(l) — l and 
g(0)—0, then it is clear that r e ^.x t ( t c is not finer than ir g ). Hence, by the 
lemma of Theorem 3.5, Z induced by g is a recoverable model. 

Returning now to the problem of evaluating the performability of S 
(with respect to the performance variable Y t as defined by (3.23)), we consider 
the problem in two cases based on the recoverability of the operational model 
Z If the operational model Z is nonrecoverable. Theorem 3.2 shows that the 
behavior of Z during [0,t] can be determined by the state of Z at the time 
instant t. In particular, we have 

Pr[Z;>:q, 0^s<t] - PrtZ^q] (3.34) 

and 

Pr[Z,>q, 0<s<t] - Pr[Z,>q] . (3.35) 

Hence the performability of S (see (3.24)) can be obtained by evaluating the 
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finite dimensional distribution PrlZ^]. More precisely, we have 

perf s (q) - Pr[Y t -q] 

— Pr[min{Zj0^s^t}— q] 

— Pr[Z,>q, 0<s^t] ~ Pr[Z,>q, 0^s<tl 

- Pr[Z|>q] - Pr[Z l >q] 

- Prl^-q] . (3.36) 

Thus, in this case, evaluating performability (i.e., to determine the probability 
distribution function of Y t ) is tantamount to evaluating the transition function 
of X, i.e.. 


perf s (q) - 2 MXrtl 
fO)— <1 

- S Pr[X,-j|X 0 -i] Pr[X 0 -i] . (3.37) 

f <& 

On the other hand, if Z is recoverable, more elaborate solution methods are 
required since (3.36) is no longer satisfied. 

When the operational model is recoverable, the performability perf s 


of S can be obtained by calculating the conditional probabilities 
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m*(t) - Pr[Y t -q, X t ^jlX 0 — i] (3.38) 

where ij«Q, q«Q and t2t0. Note that, when where t*0, the minimum 
operational rate is just the operational rate associated with the initial state; in 
short 


mfl(O) 


1 if i-j and f(i)— q, 
10 otherwise. 


(3.39) 


Then, by summing over some of the indices of m£(t), we have 


perf s (q) - J mj(t)-pj (3.40) 

ij«Q 

where Pj-PriXo-'i}. 

There are several ways of expressing the conditional probabilities 
mj(t) in terms of the state transition probabilities of the underlying Markovian 
base model X [24]. First, let us introduce another stochastic process based on 
X and the performance variable Y t as 


X - ((X„Y,)|wT) . 


(3.41) 


Then, since 
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Y t -min[ inf {^.Y,}] 

KfSt 

and since X is a Markov process, it can be shown that X is a Markov process. 

Clearly, the generator matrix of X can be expressed in terms of the 
state transition rates of the underlying Markov process X. For all iJcQ (i#j), 
let denote the transition rate of X from state i to state j. Then the generator 
matrix of X is the |Q|x|Q| matrix 

A - lay] 

where, for all i j«Q, 

“ ii_ - 2 
k#i 

► 

If, further, we let the generator matrix of X be denoted by 

A - [ayj 



where y— (k,r) and z~(j,q) belong to QXQ, then 
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a kj if 1) f(k)£:r, f(j)>:q and r-q, or 
2) f(k)^r, f(j)-q and r>q, 

|0 otherwise. 


(3.43) 


Accordingly, if we denote the transition function of X by 


Pxz(t) - Pr[X t -z|Xo-x] 

where x.zeQXQ, then the transition functions of X can be expressed as the 
system of differential equations (see [23]; pp. 254-255, Theorem 4.5) 

d _ , . ^ 

'jrPxz(V mm 2 PxyO)^. (3.44) 

>«QX$ 

Furthermore, notice that when x-(i,f(i» and z«(j,q), 

Pxz “ P I ‘f^t , “(j»q)IXo“(i,f(i))] 

- Pr[X t -^, Y t -q|X 0 -i, Y 0 -f(i)] 

" mj(t) . 


Hence, by varying the value of y over QXQ and replaying a^ by (3.43), the 
above system of differential equations (3.44) can be expressed more 
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conveniently as 


f f P* m ' 


Q : - 



2 m&(t)a y if f(0— q* f(j)>q 
f(k)*q 

2 m£(t)a y if f(i)^q, f(j)-q. 


ri&q 


0 


otherwise. 


(3.45) 


It was suggested [24] that the above equations can also be obtained by relating 
(3.38) to the notion of taboo probabilities [22]. 

Another approach for determining the probability distribution of the 
random variable Y t is to modify the underlying Markov process X by making 
some of the states in Q absorbing [24], more precisely, let us rename and 
rearrange the elements in Q into an increasing sequence Q— {1,2,...}. For each 
qcQ, N B q be a subset of Q such that 


B,-filf(i)<q! 

where f:Q — * R is the operational structure of Z. We then replace the state 
space of X by a reduced one in which a single state b q replaces the states B q and 
denote the transformed process by 
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X«- {X^T} . 

Moreover, let us denote the corresponding generator matrix of X* by 

A* - lay] . 


then, for all i j«{k|f(k)>q} (J {b q }. 


H 


S if f<i)>q, f(j)^q, 

2 ifi*b q ,j-b q , 


if i— b q . 


(3.46) 


If, further, we let the transition functions of X 4 be denoted by 


P?(t) - PriX ; HlX$-i] (3.47) 

where i J<(lc|f(lc)>q) (J (b q ). Then the conditional probabilities m|(t) can be 
expressed as follows: For all i jcQ and all qeQ, where f(i)>q and f(j)>:q. 
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mj(t) - PrlY^q^-jlXo-i] - PrlY^q-M.XHXo-i] 

- Pr[Z,^q,0:Ss:St,X t rjlX 0 -i] 

- PrlZ^q+l.O^i^XtnlXoH] 

- PrlX^-jlXo-i] - PrlX^HlXo-i] 

- Pj(t) - Pj +1 (t) • (3.48) 

Hence, if we solve the transition probabilities pj(t) for all q«Q, the 
performability of S can be computed by 

p-rf s (q)- 2 pJ(t) Pi - 2 pJ +1 (t) Pi 

- 2 fi-PilWl-Pi - 2 H-PiC (,)1 'Pi 

f(i)2q f(i)aq+l 

(3.49) 


where pj— PrIX 0 “i] sre the initial probabilities of X. 

When the total system is modeled by a recoverable operational 
model, either (3.45) or (3.49) can be used to evaluate the performability of the 
system. The solution method described in (3.45) requires evaluating a large 
syjie*', of differential equations. On the other hand, thrt solution method 
described in (3.49) decomposes the system of differential equations of (3.45) 
into smaller subsystems. Thus, evaluating (3.49) amounts to a step-by-step 
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iterative solution to (3.4S). Finally, we also note that, in addition to the 
applications illustrated in the following example and the next chapter, (3.45) 
and (3.49) can also be used to compute the "intrapbase transition probabilities* 
considered in the context of phased model (see Section 5.4). 
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3.4 Performability of a Triplicated Fault-Tolerant Computing System 

To illustrate the solution methods for the evaluation of system 
performabilities, let us consider a degradable fault-tolerant computing system 
wherein resources are triplicated and voted (triple modular redundance). This 
type of resource redundancy has been employed in a variety of hardware 
architectures (see [25] for an example of current usage) and, in a more general 
form (N modular redundancy), has been investigated as a means for 
implementing fault-tolerant software (see [26], for example). Although such a 
system will generally possess a number of triplicated resources (e.g., the various 
'triads” of the FTMP architecture [25]), let us restrict our attention to a single 
resource, say, a triplicated processor consisting of three identical processor 
modules and a voter. With respect to hardware faults, we assume that the 
processor modules fail independently and that each fails permanently with a 
constant failure rate X (failures/hr.). The system’s ability to recover from a 
hardware fault in a processing module is accounted for by a coverage parameter 
c (see [27]). 

When the system is free of hardware faults, we further assume that 
it has some capability of recovering from errors due to design faults in the 
software. Such errors are presumed to occur at a constant rate a (errors/hr.). 
Transitions from the error state, with or without recovery, occur at a constant 
rate n (transitions/hr.) which we assume to be much larger than the processor 
module failure rate X. The probability of software error recovery, given a 
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software error, is a constant d (the software error coverage parameter); error 
recovery thus occurs at a rate d«t- Lack of recovery from a software error is 
presumed to cause a crash (system failure). If a processor module becomes 
faulty and the fault is tolerated via successful voting, the input-output behavior 
of the system remains the same. With this loss of a processing resource, 
however, we assume that the system is no longer capable of software error 
recovery. Hence any further errors, due to a software fault or a second faulty 
processor module, result in failure of the system. 

Under the above assumptions, the system can be conveniently 
represented by a 4-state Markovian base model, where the states are 
interpreted as follows: 


State 

Processor 

Fault 

Software 

Error 

1 

No 

No 

2 

Yes 

No 

3 

No 

Yes 

4 

System failure 




Figure 3.1 

Markov Model of a TMR System 
with Software Error Recovery 
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The state-transition-rate diagram of the model is depicted in Figure 3.1. Note 
that when the system is attempting recovery from a software error (state 2), 
there are no transitions representing the occurrence of a hardware fault. This is 
a consequence of our assumption that n » X, in which case such occurrences 
are negligible. 

As for performance, let us suppose the user is interested in three 
levels of accomplishment: full performance (as would be exhibited by a fault- 
free version of the system), degraded performance (at least one software error 
during utilization but successful recovery in each case), and system failure. To 
obtain an appropriate operational model that can support this view of 
performance, we see that states 1 and 2 can be identified with one mode of 
operation while states 3 and 4 must be distinguished. Moreover, because the 
mode {1.2} is preferred over {3} and {3} is preferred over {4}, we find that the 
following function will suffice as an operational structure: 


i 

m _ 

1 

i 

2 

i 

3 

1/2 

4 

0 
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To establish that the performance variable in question can indeed be 
supported by the functional Z of X (Fig. 3.1) induced by this choice of f, 
suppose that Y t is formulated as in (3.23), i.e., Y t is the minimum operational 
rate experienced during the utilization period [0,t]. Then it is easily verified 
that Y t conveys the desired information, i.e.. 


Value of Y t 

Interpretation 

1 

Full performance 

1/2 

Degraded performance 

0 

Failure 


Note also that the operational model Z is recoverable in the sense defined in 
(3.4) due to the error/recovery cycle from rate 1 to 1/2 and then back to rate 
1. The performability of the system can thus be evaluated using either of the 
methods discussed in Section 3.3. 

To illustrate the solution method described in (3.49), note that the 
generator matrix of X is 


-(3\+<r) c3X <r (1— c)3X 
0 — (2X+ff) 0 2X+* 

dp 0 — n (1— d)/i 

0 0 0 0 . 
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Thus, with respect to each operational rate q*l, 1/2 or 0, the generator 
matrices of the corresponding transformed Markov processes are given by 
(3.46), viz.. 


A 1 


"“(3A+cr) c3A <r+(l— • c)3\ 
0 — (2A+<r) 2A+<r 

0 0 0 


and 


A 1/2 - A 0 - A . 


Hence, if we denote the transition functions of each transformed process by a 
matrix 


P"(t) - [pJW] (i j«{klf(k)a:q) U fb,)) 

where pj(t) is defined by (3.47), then P q (t) is determined by A q uniquely by 
the well known formula 


P»(t) - e A ' . 


In particular, we have 
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P^t) 


e -(3X+«r)t 3 c [ e -(2X+ff)t_ e -(3A+ff)t] Cj 


0 

0 


c -(2A+#)t 

0 


C2 

1 


where 


Cj — 1 — e~< 3x+ ' , ) t — 3c[e“< 2A+ »)* - c -(3A+a)tj 


c 2 “ 1 — c -(2A+*)t t 
and 


di d 2 d 3 l"-(dj+d 2 +d 3 ) 


P 1/2 (t) 


0 e -(2A+*)t 0 i— e -(2X+»)t 
^4 d 6 1 — (d 4 -hd 5 -l-d 6 ) 


0 0 0 


1 


where if we let 


z “ V9X 2 + n 2 -+■ a 2 + 6 X<r — 6\n —2 an + 4d/x<r 

x — 3X -I- m + a + 2 

2 


» 3X 4- u + g — z 
2 


then 
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*uy, 


_3X - M ± a - z c -y, | 3X 


2z 


JL±_£_±* 

2z C 


d 2 . -i ^9(y-^) c -yt _ 3Xc(x~m) 
z(y-2X-«r) z(x-2X~a) 


+ 3Xc(p-2X-<r) - ( 2 x+,) t 

(x— 2X— <r ) (y-*2X-“ (r ) 


d 3 - — e - * - — e ~ xl 


d 4 — — -4^- c -xt 


d » 6XcdM e ~yt 6X cd** -, t 

(X+ff-ji+z)z (X+<r-M-z)z 


12XcdM 


(X+<r— M+z)(X+<r-/u— z) 


g— (2X+ff )t 


d* - .1^ 0 + o + Z c yt _ _3X - M + <r - z c _ xt 


2z 


2z 


Accordingly, applying (3.40), the performability of S can be expressed as 
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perf s (l) - Pi*(l — Cj) + P2‘(1“C2) 
pcrfs(l/2) - p,-(d 1 +d 2 -hi 3 ) + p 2 -e“ (2x+ff)t 

v 

+ P 3 (d 4 -Fd 5 -l-d < ) - pcrf s (l) 

pcrfs(O) - 1 - perf s (l/2) - pcrf s (l) (3.50) 


where p,-— PrlXo—i]. 

By expressing the performability of the system in terms of the above 
closed form solution (3.49), various design tradeoffs can then be investigated 
by varying the parameter values. To illustrate, let us fix the following base 
model parameters to be 


X - 5 X IQT 4 
c - .99999 

M - 10 3 

d — .9 

and assume that the system initially has ail three modules operational, i.e., 

p-[l 0 0 03. 


Then, depending on the choice of the software failure rate <r, the 
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performability of the system exhibits various kinds of relationships between 
different levels of accomplishment. In particular. Figure 3.2 and 3.3 display the 
performability of S as a function of t (the duration of the utilization) for 
<jr—10“ 2 and <r“10“ 3 , respectively. 

In both figures, the performability of S is represented by three 
curves: I, II and m. Curve I is the probability of fault-free operation 
throughout the utilization. The probability decreases from 1 to 0 as t goes to 
infinity. Curve II is the probability that the system suffered from performance 
degradation due to software faults while remaining operational throughout the 
utilization period. Finally, we note that curve III is the familiar S-shape 
function for system unreliability. 

When we compare Figure 3.2 with Figure 3.3, we find that 
substantial performance improvement is obtained by reducing the number of 
design faults in the software. For example, consider the case when the 
duration of the utilization is 400 hours. With the reduction in software failure 
rate by a factor of 10, the probability of degraded performance and the 
probability of system failure are reduced, respectively, to 1/2 and 1/3 of their 
original values; more significantly, the probability of full performance is 
increased by a factor of 37. 

Although the above example has established the feasibility of the 
hierarchical modeling approach for the evaluation of system performabilities, 
the operational model constructed in the example is based on a specific state 
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transition diagram (Figure 3.1). Thus, to extend the generality of the model, 
we consider in the next chapter a systematic approach for describing the 
underlying Markov process together with a general method for determining the 
operational structures of degradable computing systems. Moreover, we also 
note that the results developed in this chapter are further extended in Chapter 
5 to permit the modeling and evaluation of systems with a time-varying 
environment. 



P erf c(q) “ Pr(Y -q] 
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t (hours) 


Figure 3.2 


Performability of S 


P erf s Cq) - Pr{Y t -q] 


1.0 



Figure 3.3 
Performability of 3 



CHAPTER 4 


MODELING AND EVALUATION OF DEGRADABLE 
MULTIPROCESSOR SYSTEMS 


4.1 latrodactioa 

The design of a distributed multiprocessor system (see [28]-[30], for 
example) is generally approached in a sequential manner beginning with the 
identification of the computing system’s application. The problem identification 
phase is followed by a functional breakdown of the application into major 
subtasks to be performed by the system. Following these phases, the designer 
then specifies the performance and reliability requirements in terms of the 
resource requirements for each task, the time relationship between tasks, the 
executive software overhead for system control, etc. Finally, based on the 
performance and reliability requirements of the system, alternative hardv*.ie 
and software architectures Are then considered to optimize cost, performance 
and other trade-olf criteria. 

Moreover, in the design of multiprocessor systems for real*time 
control applications, tasks to be performed are often partitioned into several 
priority groups and priority interrupt mechanisms are used to meet the stringent 
constraints of fast response time (see [29] and [30], for example). Normally, 
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all tasks are executed iteratively to generate sample-time updates of control 
variables. However, when the computer’s resources decrease due to faults to a 
point that only some of the tasks can be completed in time, tasks from the 
higher priority groups are given preferential treatment over tasks from the 
lower priority groups. Computing systems capable of performance degradation 
as above are often referred to as gracefully degradable computing systems or 
simply degradable computing systems. 

Performance degradation of degradable multiprocessor systems are 
typically realized through the following steps [31]: 

1) Error detection: Basic techniques for error detection include error 
detecting/correcting codes, time-out counter, memory protection, majority 
voting, periodic testing, etc. 

2) Fault location and hardware reconfiguration: Once an error is 
detected, diagnostic programs and testing strategies are used to localiz* the 
faulty components. The hardware reconfiguration program is then mailed upon 
to establish an operational configuration. 

3) Computation recovery: Computation recovery concerns the 
restoration of a valid system state from which the system can resume its 
operation. The restoration of a valid system state can be achieved by rollback 
and retry, uses of traces, program roll-aheads, etc. 

4) Software reconfiguration: Fault-tolerant software permits the 
replacement of suspected software modules with their alternative versions at 
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nin time. Current approaches include N-version programming [26] and the use 
of recovery blocks [32]. 

There are many different designs of degradable multiprocessor 
systems that are potential candidates for real-time control applications. In 
general, these designs can be characterized in terms of the degree of 
redundancies built into their basic components (e.g., simplex, duplex and 
triple-modular-redundancy). The characterization is useful because each class 
of systems can be attributed with certain specific performance and reliability 
trade-offs. 

Replicated components are often used in a degradable 
multiprocessor systems to enhance the system reliability. For example, when 
triple-modular-redundancy (TMR) is used, not only all single faults can be 
tolerated, but procedures for error detection, fault-location and system 
reconfiguration are also simplified considerably. However, since there is no 
parallel-prc cessing within TMR, the configuration represents a substantial loss 
of computing power. Systems using triplicated components in their design 
include C.vmp [33], SIFT [30] and FTMP [25]. 

When the application of a degradable multiprocessor system requires 
not so much reliability (i.e. uninterrupted operation throughout the utilization 
period) as the ability to recover from failures, simplex components are often 
used to improve the performance/cost ratio of the system (e.g., PRIME [34] 
and PLURIBUS [29]). Since such systems rely mostly on complicated self- 
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testing logic for error detection and location, the reliability of these systems is 
generally lower than that of systems using replicated components. This is 
because single faults may cause such systems to crash and the detection latency 
[3S] (the time period between the first error and the first detected error) of 
these systems are generally longer. 

In this chapter, a comprehensive model for degradable 
multiprocessor systems is presented for studying the trade-offs between systems 
with different degrees of component redundancy. The model is based on the 
approach considered in the previous chapter; a Markovian base model is 
described to represent the physical resources of the system and priority queuing 
models are used to determine the operational structure associated with the base 
model. Since our model supports the evaluation of system performability, it 
differs substantially from those considered by Borgerson [36] and Losq [37] 
which stress hardware-oriented measures such as reliability or availability. 

4.2 System Model 

As compared with existing Markov models for degradable 
multiprocessor systems (see [36] and [37], for example), the model presented 
here has the advantages that (i) the partitioning of the system is K-* on the 
system's available resources as well as computational requirements of the user's 
application, (ii) the hierarchical representation (i.e., the base model together 
with the operational structure) permits the formulation and evaluation cf user- 
oriented performance variables. 
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4.2.1 Base Model for System Resources 

Degradable multiprocessor systems typically can be divided into 
several physical resources each of which is made up of one or more identical 
components. These resources generally form a pool shared by tasks to be 
performed by the system. For example, as described in [25], the FTMP 
computer includes 15 processors, 9 memory units, 5 busses and 48 bus 
guardian units to be shared by aircraft functional tasks. Generally, the amount 
of resources that a system can provide varies from time to time depending on 
the intrinsic hardware failure rates, the effectiveness of fault tolerance 
mechanisms and the repair procedures. Hence, if we assume that (i) the 
occurrences of failures and repairs are independent among different 
components, (ii) each component has constant failure and repair rates, and (iii) 
fault tolerance mechanisms are "memoryless* in the sense that they are 
determined by the current state of the system, then the resource availability of 
the system can be represented as a Markov process. 

Before attempting t ^escribe a general Markov model for system 
resources, let us consider first the effects of failures on the system resources. 
As suggested in [37], two classes of hardware faults can be distinguished 
according to the characteristics of fault tclerance mechanisms. The first class 
corresponds to hardware faults that are detected and recovered from as soon as 
they occur. Hardware faults of the first class are referred to as safe faults 
because failed components are removed instantaneously from the resource pool 
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to avoid data contamination or faulty control. The second class of hardware 
faults, referred to as unsafe faults , contains those that are either latent faults or 
those that are in the process of fault recovery. Depending on the degree of 
component redundancy, failure to tolerate unsafe faults may cause the system 
to crash because of the loss or the corruption of important information. 

Since the performance of a system is determined by the amount of 
fault-free resources, the state of the resource model can be defined to be the 
number of safe faults and unsafe faults. More precisely, suppose that the 
multiprocessor system contains N rc-o **. where the I th resource contains n| 
identical components. Suppose further component of the I th resource has 
failure rate X] and repair rate ^i* If we assume that the occurrence of more 
than one event such as failure or repair completion has negligible probability, 
then the system can be represented as a Markov process 

X-{X,luT) (4.1) 

where, for each t«T, X t is random variable taking values in the state set 


Q - {(a 1 ,b,,a 2 ,b 2 ,...,a N ,b N )|0<a i +b i <n i for all l<i<Nj . 


For each state (a 1 ,b 1 ,a 2 ,b 2 ,...,a N ,b N ) in Q, a 4 denotes the numbei of safe fau ts 
of the i lh resource and bj denotes the number of unsafe faults of the if k 
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resource. 

Four types of state transitions can be distinguished: repair, safe 
fault, unsafe fault and fault recovery (see Table 4.1). On the completion of a 
repair, the number of safe faults is decreased by one. Once a hardware fault 
has occurred, either the number of the safe faults or the number of unsafe 
faults is increased by one depending on whether the fault is safe or unsafe. 
Finally, the successful recovery of an unsafe fault will decrease the number of 
unsafe faults by one and increase the number of safe faults by one. 

To derive the transition rates of the above state transitions, the 
system's ability to recover from a hardware fault is modeled by a single 
parameter c defined to be the conditional probability that a system will be able 
to recover once a hardware fault has occurred (referred to as the coverage of 
the system, see Section 3.6). It is further assumed that the probability of a 
failure being transient is represented by another parameter a. Based on the 
above assumptions about the failure and repair characteristics of the system, 
the transition rates of the above Markov process can be expressed as in Table 
4.1 (where the parameters are summarized in Table 4.2). 
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When the system contains only one type of resource (i.e., N*l), 
the model can be represented more conveniently using transition graphs as in 
Figure 4.1. This single resource model, as a special case to the above model, is 
the same as the one considered in [37] by Losq except that states are named 
differently here to simplify the formulation of operational rates. 

Another special case of the above general model can be obtained 
when all unsafe faults result in system failure. In this case, since unsafe faults 
cause system resources to vanish, the state of the system can be taken to be the 
number of fault-free components at each moment in time. In other words, the 
state set Q defined in (4.1) can be represented as 


Q - {(a 1 ,a 2 ,...,a N )|0<a i <n i for all lsi<N} (4.2) 

where, for each (a it a 2 , . . . ,a N ) in Q, aj is the number of fault-free 
components of the i th resource. The simplification on the corresponding 
transition rates is described in Table 4.3. Again, as a special case, the single 
resource model can be represented by a transition graph as in Figure 4.2. 
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Interpretation 


Component failure rate of the i tb resource 

Mi 

4 

Component repair rate of the i 1 * resource 

V \ 

Component recovery rate of the I th resource 
following unsafe faults 

a 

Probability of a fault being transient 

c 

Probability of a fault being safe 
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Figure 4.2 
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4.2.2 Performance Variable 

The performance and reliability requirements of a real-time system 
often differ from those of a general purpose ct -nputer because of the stringent 
constraints of fast response time. Thus, to characterize the performance and 
the reliability of a real-time system by a single performance variable, the 
performance variable must take into account both the resource availability as 
well as the promptness of the system response. 

As an example, let us consider first a typical real-time environment 
encounted by a control computer. In a study concerning the design of fault- 
tolerant computers for aircraft, Ratner et al. [39] have identified 26 
computational tasks most likely to be performed by the control computer of an 
advanced commercial aircraft. These computation tasks conceptually can be 
regarded as short programs stored in a common memory of a multiprocessor 
system and each task is scheduled to be executed periodically according to a 
predetermined frequency. However, the actual execution of a task may be 
delayed from its scheduled execution time because of the resource sharing, 
interface between tasks and the overhead for running system software. Since a 
prolonged starting-time delay may cause dangerous conditions to develop, the 
computational tasks are grouped into 5 priority classes and priority interrupt 
mechanisms are used to reduce the delay times of more critical tasks. 


The above example dearly shows that the concept of the starting 
time delay is a useful tool for spedfying the performance criteria of a real-time 
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control application: A task is regarded as failing to satisfy the real-time 
constraint if its starting-time delay, on a regular basis, exceeds a certain 
predetermined value. More precisely, let d be the estimated length of time that 
would have to elapse before an undesirable condition is noticed. Then the 
real-time constraints can typically be stated as: either the average starting-time 
delay or the percentile starting-time delay must be less than d. For example, 
the percentile starting-time delay can be stated as that the starting-time delay 
should not exceed d for 99% of the time when a task is scheduled to be 
executed. For simplicity, it will be assumed in the following discussions that 
the average starting-time delay is used to specify the real-time constraints. 

To generalize the above notion of the starting-time delay, a user- 
oriented performance variable can be formulated as follows. First, let us 
assume that the tasks to be performed by the computer belong to one of a set 
of k different priority classes. Then, relative to the utilization period T— [0,t], 
the performance variable can be defined as a random variable Y taking value in 
the set {0,1 ,..., k} such that 
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Y-k iff 

k is the largest nonnegative integer less than or 
equal to k such that all tasks from the first 
k priority groups are executed within the real-time 
constraints throughout T. 

In other words, if we adopt the convention that the priority groups are 
numbered in reverse order (i.e., the smaller the number, the higher the 
priority), then the performance variable Y can be regarded as the degree of 
user satisfaction relative to how well the more critical tasks are executed by the 
computer to satisfy the real-time constraints. In particular, Y * 0 can be 
interpreted as a system crash, since, in this case, the computer does not even 
have enough resources to meet the demand of the most critical tasks (i.e., 
those from priority group 1). On the other hand, Y —k represents nondegraded 
performance, since all tasks are executed properly within their real-time 
constraints. 

4.2.3 Operational Structure 

To ease the evaluation of system performability, the connection 
between the performance variable Y and the base model X can be established 
more easily by introducing intermediate models to account for the internal 
structure of the computer. Under the assumptions described in Sections 4.2.1 
and 4.2.2, a natural representation of the system's behavior at the operational 
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level is the behavior of the computer’s control programs. Typically, to resolve 
conflicting demands on system resources, the control software of a 
multiprocessor system must provide a scheduler (also called a supervisor or 
executive) that allocates system resources to application tasks and handles 
interfaces between tasks. When the computer is degradable, the control 
software must also provide a mechanism (called a reconfiguration mechanism) 
which, upon detection of faults, appropriately changes the scheduler to facilitate 
error recoveries. In other words, given a general resource model as described 
in Figure 4.1, it is possible to associate each state of the model with a 
scheduler. Accordingly, depending on how well the application tasks are 
performed within each resource state according to a given scheduler, various 
operational rates can be identified to reflect different degrees of user 
satisfaction. 

To measure the effectiveness of the scheduling algorithms associated 
with the resource states, it is assumed that each scheduler is modeled by a 
resource sharing priority queuing model (see [21], for example). For each 
resource state of a resource model, it is further assumed that arriving tasks 
form a single queue according to the "head-of-the-line" (HOL) queuing 
discipline (see Figure 4.3), that is, an arrival from priority class k joins the 
queue behind all 'customers” from priority class k (and higher) and in front of 
all 'customers' from priority class k+1 (and lower). Moreover, the value of 
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one’s priority remains constant in time. Thus, while the customers with the 
highest priorities are selected for service ahead of those with lowei priorities, 

customers from the same priority class are served on a first-come, first-served 

\ 

(FCFS) basis. Finally, we also note that two possible refinements in priority 
mechanism can be distinguished depending on whether the execution of a low- 
priority task is interrupted when a task of higher priority arrives. 

Since we are concerned with the ability of a system in satisfying the 
real-time constraints, the effectiveness of each scheduler can be measured in 
terms of the "expected waiting times" of the corresponding queuing model. 
More precisely, for each state q of a general resource model and for each 
priority class k, let rjj be a random variable denoting the time spent waiting in 
the queue of a priority k "customer" with respect to the queuing model of 
resource state q (see [40], p. 189). Then the expected value of r£ is a close 
approximation of the average starting-time delay when i) the communication 
delays between tasks are negligible, and ii) the transition rates of the given 
resource model are much lower than the arrival and the service rates of the 
computational tasks. The first condition can typically be satisfied by treating 
each task as an atomic unit for resource allocation. The second condition is 
usually satisfied automatically because failures and repairs of a computer occur 
much less frequently than the iteration rates of the computational tasks (thus 
the expected waiting times rapidly approach steady-states once the system 
enters a particular state). 
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The above relationships between the average starting-time delays 
add the expected waiting times provide us with a basis for partitioning the 
resource states into operational modes. First, we note that the ability of a 
resource state q in satisfying the real-time constraints of a priority k task can be 
expressed as 


E[r*] < d (4.4) 

where d is the predetermined time length as described in the previous 
subsection. The use of average starting-time delays to specify the real-time 
constraints is reasonable because, in general, the effect of a starting-time delay 
on the behavior of the system is proportional to the duration of the delay. 

In addition to the real-time constraints, the partitioning of system 
states into operational modes must also take into account the effects of unsafe 
faults. For example, consider a multiprocessor system with triplicated 
components. The occurrence of a undetected double fault in a triad will cause 
the system to fail regardless of the duration of starting-time delays. 
Accordingly, if we assume that the probabilistic nature of the system resources 
is specified as a Markov process X with state space Q (see (4.1)), then an 
operational structure can be given as a function 


f:Q — * {0,1,..., x] 
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where, for each q«Q, 


OF POOR QUALITY 


f(q) 


^ if unsafe faults in q are tolerated and, 

for some l<k<«, Elrg]<d and E[rg +1 ]^d, 

0 otherwise. 


(4.5) 


Note that, in the above formulation, we have assumed that tasks of higher 
priorities have shorter expected waiting times, i.e.. 


k < k' implies E[r£] < E[r£] . 

The assumption is satisfied when the scheduling algorithm (associated with q) 
uses the usual priority queuing disciplines in resource allocation. Having 
established the operational structure, the performance variable describee in 
Section 4.2.2 can now be expressed as 

Y ~ min{f(X t )|t«T} . 

Finally, to conclude the construction of the operational model, we 
note that the analysis of a priority queuing system is generally more difficult 
than that of a nonpriority system. In particular, for the multipit channel 
(multiple server) case, it is usually required to assume no service-time 
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distinctions between priorities or else tbe mathematics becomes intractable. 
However, as illustrated in the following section, even under the above stringent 
conditions, the scheduling algorithms of a single resource model can still be 
modeled satisfactorily using a priority M/M/m queue (see [40], pp. 193-194). 

For multiple resource models, the scheduler associated with each 
resource state can be modeled by an open or a closed queuing network (also 
called network of queue; see [4] and L21], for example). In addition to the 
basic assumptions governing the arrival and service rates of each 'service 
station” (see [4], pp. 161-163), it is also required to assume that the priority 
queuing discipline at each service station is work conserving in the sense that 
the priority interrupts will not impose extra work on the server. Solutions tor 
the expected waiting times at each service station can be obtained by 
incorporating solutions for priority M/M/m queues in the multiple resource 
models. 

4.3 Eraluation of Two Degradable Multiprocessor Systeass 

In this section, the performances of two degradable multiprocessor 
systems are evaluated and compared to determine the tradeoffs between two 
design approaches. Both of the computers to be evaluated are assumed to be 
multiprocessor systems containing 4 identical processors and a common 
memory module. They differ in the way that the system resources are allocated 
to perform computations. One computer (S t ) is assumed to operate in simplex 
mode, i.e., the processors are allowed to operate independently. The other 
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computer (SJ is assumed to operate i. duplex mode, i.e., the processors are 
paired. into duplex subsystems to enhance the system reliability. While the 
simplex mode of operation generally can improve the performance/cost ratio of 
the system, the duplex mode of operation provides better detection and fault 
coverage relative to the simplex mode of operation. 

With respect to hardware faults, we assume that the processor 
modules fail independently and permanently with a constant failure rate X 2 
(failures per hour). The memory module is assumed to have a constant failure 
rate Aj and fails independently with respect to other subsystems. We further 
assume that the system’s ability to recover from a failure is accounted for by 
the coverage factors Cj and c 2 , respectively, for and S 2 . Since simplex 
subsystems rely on error detecting codes and self-checking logic for error 
detection and location, the coverage of S] is generally lower than that of S 2 . 
This is because, in the simplex mode of operation, undetected single faults may 
cause the system to crash and the detection latency (i.e., the time duration 
between the first error and the first detected error) of a simplex subsystem is 
generally longer than that of a replicated subsystem. Accordingly, it will be 
assumed that 


Ci C2. 


(4.6) 


Under the above assumptions, both Si and ^ can be represented as 
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a Markov process according to the gener’ resource model as described in (4.1). 
However, since neither the simplex mode of operation nor the duplex mode of 
operation can tolerate unsafe faults, the model can be reduced to the special 
case whose state space is given by (4.2). In particular, each of Sj and Sj can be 
conveniently represented by a 10-state Markovian base model X; (i— 1 or 2), 
where the state space Qj (i— 1 or 2) can be represented as 

Qi - { (k j) | k-0,1 and j-0,1,2,3,4 ). (4.7) 

For each state (kj) in Q r k denotes the operational status of the shared 
memory module (0— failed and 1 —working) and j denotes the number of 
working processor units. Thus, for instance, a fault-free configuration is 
encoded as state (1,4). The state-transition diagram of the model is depicted in 
Figure 4.4. 
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State-Transition Diagram 
of the Base Model for Sj a.id S 2 
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To provide a useful comparison between S} and Sj, their 
performance must be evaluated with respect to the same work environment. In 
this regard, let us consider a typical real-time control application where its 
control computer uses priority interrupt mechanisms to meet the stringent 
constraints of fast response time. For simplicity, let us assume that each 
computational task of the system is assigned a priority of 1 or 2 denoting, 
respectively, high or low-priority. Normally, all tasks are executed iteratively to 
generate sample-time updates of control variables. However, when the 
computer’s resources decrease, because of failures, to a point that only some of 
the tasks can be completed in time, tasks from high-priority group are given 
preferential treatment over tasks from low-priority group. Based on the above 
assumptions of the work environment, a user-oriented performance variably 
can then be formulated as 


2 if the system is operating 
as prescribed. 


1 


Yi 


if only high priority tasks meet 
the average response time requirement. 


0 if high priority tasks can not meet 
the average response time requirement. 


( 4 . 8 ) 


In other words, the performance of Sj conveys the following 


information: 
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QUALITY 


Value of Yj 

Interpretation 

2 

normal 

1 

degraded 

0 

failure 


On closer examination of the relationship between Y; and Xj, we 
find it is necessary to introduce an intermediate model between them because 
is not detailed enough to support the user’s view of system performance Yj. 
One way to introduce such an intermediate model is to identify an operational 
model by taking into account the workload environment of the computer. In 
this regard, let us assume that tasks from priority group i (i—1 or 2) arrive in a 
Poisson stream at 1 task per millisecond. We also assume that each task, 
regardless of its priority, requires a service time exponentially distributed with 
mean service time 1/3 milliseconds. Furthermore, we assume that the HOL 
queuing discipline is used, but there is no preemption [5]. Then, if the behavior 
of the system in a state q is modeled as a queuing system with m servers, the 
average starting-time delay of tasks from priority group i can be approximated 
by the expected waiting time of the M/M/m priority queue. 

As for the operational structure, the states of the base model X* can 
now be partitioned into three operational modes according to their average 
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starting-time delay. In particular, suppose that the average starting-time delays 
of both priority groups shall not exceed d milliseconds. Then, the operational 
structure of ^ (i— 1 or 2) can be expressed as a function 


fyQi - R 


such that, for each q<Q it 



if E(rp]^d and E[rf]rSd, 
if E[r?]^d and E[r|]>d, 
otherwise. 


(4.9) 


If d “ 1/2 millisecond, the operational structures of S t and S 2 can be tabulated 
as follow: 


State q of Sj 
(i - 0 or 1) 

Operational rate 
f,(Q) of Si 

Operational rate 
— f.(a)ofS, 

0.4) 

2 

2 

0.3) 

2 

2 

0.2) 

2 

1 

O.l) 

1 

1 

0.0) 

0 

0 

(0,4) 

0 

0 

(0,3) 

0 

0 

(0,2) 

0 

0 

(0,1) 

0 

0 


0 

0 
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The operational rate assignment* in the table are determined by the 
degree of subsystem redundancies. For instance, suppose that the system if in 
state (1,2), i.e., the memory module is operational and two processors are 
working. Then, since in the duplex mode of operation both processors must 
perform the same function in parallel, the behavior of S 2 in state (1,2) can be 
modeled as a M/M/1 priority queue. Thus, using the existing formula for 
computing the expected queueing times (see [40], for example), the average 
starting-time delays can be approximated by 

Tf * 1/3 milliseconds and T£ * 1 milliseconds 
where q=(1.2) Since, in this case, 

Tf ^ d and T£ > d, 

we have f 2 (q) 3K l by equation (4.9). On the other hand, since the behavior of 
S t in state (1,2) can be modeled as a M/M/2 priority queue, the average 
starting-time delays can be estimated as 


Tf — 1/30 milliseconds and Tf — 1/20 milliseconds 


where q— (1,2). Accordingly, it follow: by (4.9) that fj(q)—2. 
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Given the above performance variables Yj and the operational 
structures f it it can easily be verified that, for each Xj (i— 1 or 2), 

Yj — min { f^Xj) I t«T } 

where T is the utilization period. Thus the solution methods described in the 
previous chapter can now be used to solve the system penormability (i.e., the 
probability distribution function of Yj). 

Suppose that the system is initially fault-free, i.e., Pr[X 0 *-(l,4)] — l. 
Then, relative to the utilization period T— [0,tj, the performability of S] is 
given by 

perf,(2) - Prl Y,-2 ] 

. e -(4A,+X,)t + 4c [ c “K3XifX0* _ e -(4A J +A 1 )tj 

+ 6c, 2 [e“ (2x ’ +x,)t - 2e _(3Xj+x,,t + c ~ (4Xj+x,)t ], 

perf,(l)«Pr[ Y,-l ] 

- 4c 1 3 [e~ (Xj+x,)t - 

+ 3 e '( 3X rt-A,)t _ e -(4ArfA,)t^ 


and 
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perfj (0) - Pr[ ^-0 ] 

* 1 — perfj(2) - pcrfj(l). 

Similarly, the performability of S 2 is given by 

perf 2 (2) - Pr[ Y 2 -2 ] 

M c -(<x 2 +x,)t _|_ 4 C2 [ e ~( 3X J + ^i) t _ c -(<x 2 +x,)tj 

perf 2 (l) - Pr[ Y 2 -l ] 

- 4c 2 3 e” (Xjf+x,)t + 6(c 2 2 -2c 2 3 )e~ (2X2+x,)t 

+ 12(c 2 3 -c 2 2 )e' (3X:+x,)t + (6c 2 2 -4c 2 3 )e~ (4Xj+Xl)t , 

and 

perf 2 (0) * Pr[ Y 2 —0 ] 

- 1 — perf 2 (2) — perf 2 (l). 

When expressed as functions of the duration of the utilization, the 
above equations can be represented, respectively, as in Figures 4.5 and 4.6 for 
the indicated parameter values of Sj and S 2 . When we compare Figure 4.5 with 
Figure 4.6, S 2 clearly results in better system reliability than S] in the sense that 


perf 2 (0) < perfj (0). 
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Moreover, Sj also has a much higher degraded performance even though its 
nondegrad ed performance is slightly lower than that of Sj. In general, we may 
conclude that, if the system’s ability to recover from faults is low, then the 
performance of the system can be made to degrade more gracefully by allowing 
the subsystems to operates in duplex mode. 



perf 1 (q) « Pr[ Y^q ] 
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Figure 4.5 
Performability of S, 


perf 7 (q) ■ Pr[ Y 
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Figure 4.6 


Performability of S 2 


CHAPTER 5 


PHASED MODELS 


5.1 Phased-Missioas 

5.1.1 Introduction 

The performability models considered in the previous chapters 
assume that the environment of the system is invariant in time in the sense 
that the underlying processes are time-homogeneous and the operational 
structures of the system remain the same thro ighout the utilization period. 
Although this assumption is appropriate for certain applications, there are many 
cases where the user’s demands on the computing system can change 
appreciably during different phases of its utilization. This is particularly true for 
real-time control applications in which the computing system is required to 
execute different sets of computational tasks during different phases of a 
control process. 

One approach to dealing with a time-varying environment is to 
decompose the system’s utilization period into consecutive time periods 
(usually referred to as a decomposition of the system’s "mission" into "phases*; 
see [42]- [45]). Demands on the system are then allowed to vary from phase to 
phase; within a given phase, however, they are assumed to be time invariant. 
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This permits intraphase behaviors to be evaluated in terms of conventional 
time-homogeneous models, but raises the interesting question of how the 
intraphase results are combined. This is the essential question addressed in 
investigations of phased-mission reliability evaluation methods (e.g., [42]-[45]) 
where the problem has been constrained as follows. It is assumed, first, that a 
success criterion (formulated, say, by a structure function; see [18] for 
example) can be established for each phase, where the criterion is independent 
of what occurs during other phases. It is required further that successful 
performance of the system be identified with success during all phases, that is, 
the system performs successfully if and only if, for each phase, the 
corresponding success criterion is satisfied throughout that phase. 

Although the above constraints are reasonable for certain types of 
systems, they exclude systems where successful performance involves 
nontrivial interactions among the phases of the mission. In more exact terms, 
it has been shown (see [46] Theorem 6) that 'structure-based” formulations of 
success are possible if and only if the phases are functionally independent in a 
precisely defined manner. What we wish to do, therefore, is to examine the 
utility of phased-mission evaluation methods in a less restrictive context. 

In addition to removing the above constraints, we extend the 
domain of application to include evaluation of computing system performability. 
Moreover, by representing intraphase models in terms of operational models, 
we are able to obtain useful results even without the typical no-repair 
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assumption of the traditional phased-mission reliability methods. 

Finally, unlike the models used in phased-mission reliability 
evaluation methods, we permit the intraphase models to differ from phase to 
phase. Thus, the modeling of a particular phase can be tailored not only to the 
computational demands of each phase but also to the relevant properties of the 
total system that influence performance during the phase. 

5.1.2 Formulation 

Intuitively, phased-mission^ are real-time control processes whose 
utilization period can be decomposed into phases. During each phase, the 
system is required to execute a predetermined set of computational tasks. A 
typical example of a phased-mission is an 'unmanned space mission* during 
which the spacecraft’s on-board computer must complete different phases of 
the mission. The analysis of such a system is usually complicated because of 
the time-varying nature of the system’s performance criteria. 

To generalize the notion of a phased-mission in the context of 
performability modeling, we assume that (i) the behavior of each single phase 
can be characterized by a single performance variable regardless of the 
interactions among phases, and (ii) interactions among phases can be 
characterized without reference to the detailed behavior of each phase. More 
precisely, let us suppose that the utilization period T is the continuous interval 
T *■ [0,h]. Suppose further that T is divided into a finite number of 
consecutive phases (time intervals) Tj — [^.tj], T 2 “ [t|,t 2 ],..., T m ■» [t tn _|,t m 3 
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where 0 — tgCt^ • • • Ct,,, ■* h. During each phase T k , we assume that the 
system. can be modeled by a performability model (X k , 7 k ) where 


X k - {X,k|tfT k i 

is a continuous-time stochastic process such that, for each t in T k , X t k is a 
random variable taking values in the phase k state space Q k (X k : Q — *Q k ), and 

7 k : U k — A k 

is a function that maps the phase k trajectory space U k to the phase k 
accomplishment set A k . X k is referred to as the intraphase process (of phase k) 
and 7 k is called the phase k capability function. When each phase can be 
represented by an intraphase performability model, a performability model 
(X, 7 ) of S is referred to as a phased model if it can be constructed by the 
following steps: 


(i) X - u x k - u (X, k lt«T k l 

k-1 k— I 


(5.1) 


(ii) there exists a function (referred to as an organizing 
structure ) 
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¥:AjXA 2 X • • • XAjjj -» A (5.2) 

such that, for all u«U, 

7(u) - ♦(7 1 (u 1 ) 7 m (u m )) 

where u k is the restriction of u to T k (k- 1,2, 

On examining X we see that it is similar to a base model except 
that, for each time instant t k (l^k<m), the state of the system is represented 
by two random variables x£ and x£ +1 whose values, respectively, are the final 
state of the k * phase and the initial state of the k+l* phase. Since we permit 
the state sets of the intraphase models to differ from phase to phase, 
X £ and Xu” can also be different. However, if we consider an augmented 
utilization period 


T-TU{t k 'lk-l,2,...,m— 1} 


(where t k ' can be interpreted as the initial time of phase k+1), then X can be 
expressed as 


X - {X,|t.T) 


where 
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Xt- 


X, k 


if t“0 
if tc(t)[— | »tfc) 


xf 1 

if t 


V- 


(5.3) 


If, further, we regard the state space of X as the union 


Q“ U Qk. 

k— 1 

then X is a base model in the sense defined in Chapter 2. When X is so 
constructed from intraphase processes, we will refer to it as a phased base 
model. 

Generally, there are two types of dependencies among phases that 
can affect th* performability evaluation of phased models. The first type, 
encountered when computing intraphase performabilities, are caused by 
statistical dependencies among phases. For example, if we assume that the k th 
intraphase process X k is a Markov process with a given transition probability 
function, then the performability of the system during phase k can be 
computed once the initial distribution of X k is known. However, in general, 
the initial distribution of X k is determined by that of the first phase together 
with the behavior of the previous phases in realizing a specific level of 
performance of the total system. 
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The second type of dependence occurs when combining intraphase 
performabilities to determine system performability. This type of dependence 
is determined by the algebraic relationship among phases, i.e., those 
relationships which do not involve probability concepts. In other words, the 
relationship is analogous to structure functions that are concerned with the 
structural representation of multicomponent systems (see [47], for example). 
Clearly, a complete analysis of phased models will require a detailed knowledge 
of both types of dependencies and their effects on the performability of the 
system. 

In the following section, we first study the above algebraic 
relationship among phases via an extended definition of structure functions. In 
Section 5.3, we then consider the probabilistic aspects of the dependencies 
among phases. In both cases, we only assume that the behavior of the system 
during each phase can be summarized by a performance variable Y k defined by 
the phase k performability model (X 1 ,*^). Finally, in Section 5.4, 
computational methods and formulas are derived, when Y k can be defined as 
the minimum value assumed by a functional of X k . 

5.2 Structural Properties of Phased Models 

In system theory, the structure of a system is generally taken to be 
the interactions among subsystems to perform certain specific tasks. The 
interaction may involve the physical interconnection of the subsystems or, 
more generally, 'functional dependence* among subsystems [46]. Thus, if we 
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regard the phases of a phased model as subsystems, then the structure of a 
phased model .an be effectively regarded as the interaction among phases in 
the realization of various degrees of system performance. 

Since the relationship among phases can be represented as an 
organizing structure, and since the accomplishment sets of the phases are 
totally ordered sets (see (2.2)), thc'e is a natural connection between the 
structure of a phased model and the mappings of partially ordered sets. More 
precisely, for each phase k, let us denote the phase k performance variable as 

Y k :Q — A k 

defined by the phase k performability model (X k , 7 k ) according to (2.12). 
Then, by (5.2), the performance variable )f the total system S can be 
represented as a random variable 


Y:Q — A 


where, for each Q , 


Y(«)-*(Y,(t>) Y m («)) . 


Thus, the structural relationship between Yj.Yj,.. ,Y m can be characterized in 
terms of the properties of the mapping ♦iAjX • • • XA m —»A. The product of 
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sets A]X ■ • * XA m is a partially ordered set because, by extending the ordering 
relation of the individual phases, an ordering relation for AjX * * * XA m can be 
defined as, for all (a^,...,^) and (bj^.—.b,,,) in AjX • • • XA m , 


(*1**2. * * • »®b) — (^l»^2»*‘**^m) 

iff a^bj^ for all k *■ l,2,...,m . (5.4) 

Our interest will be restricted to the case when 'PiAjX • • • XA,,,— »A 
is order-preserving in the sense that, for all (a^, . . . , a,,,) and (b 1 ,b 2 ,...,b m ) in 
AjX • • • XAj,,, 


(*1**2» • • • **m) — (^1*^2* • • • »fim) 
implies *(ai,a 2 ,...,a m ) ^ *(bi,b 2 ,...,b m ) . 

r.>r-prcscrving mappings as defined above can be used to characterize 
yr. ms whose performance do not deteriorate due to the performance 
improvements of the subsystems. Thus, the notion of an order-preserving 
mapping is a proper generalization of the notion of a coherent structure 
function [18] because the coordinates of AjX • • • XA m are not restricted to 
binary valued sets. 


Our investigation efforts will be focused on the properties of ¥ 
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which permit us to simplify the evaluation of system performability. In 
particular, given a subset B of A, we wish to consider methods for representing 
the set i'~ 1 (B) without enumerating its elements. The representation methods 
are important because, as shown in the following discussions, such 
representations are amenable to iterative methods of evaluation. 

We note first that the effects of ^ on A 1 X • * • XA m is that it 
imposes an order structure on the equivalence kernal of the order-preserving 
mapping. First, let us define, for all a«A, 


C(a) - {qlqcA^ • • • XA m and ¥(q)>a} (5.5) 


and 


D(a) * {qlqcAjX • • • XA m and ¥(q)>a) . (5.6) 

In words, D(a) is the set of elements in A t X • • • XA m that result in at least 
level a accomplishment. Clearly, when ¥ is an order-preserving mapping, for 
all a,b€A, 


a fS b implies 


C(a)2C(b) and D(a)2D(b) . (5.7) 


Accordingly, if we let 
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C — {C(a)|atA| 


(5.8) 


and 


D — {D(a)la<A} , (5.9) 

then, because A is a totally ordered set, it follows that C and D are totally 
ordered sets with respect to set inclusion. Moreover, if the elements in A are 
expressed as a sequence 


■ • • <aj<a 2 < • • • <aj< • • • 

where, for all i, aj+i covers aj in the sense that aj+i>aj and for no 
at A, a i+1 >a>a i , then the corresponding elements in C and D can be expressed 
in a like manner, i.e., 


• * • 2C(a!)2C(a 2 )2 * • • 2(^)2 * * * 


and 


• • • 2D(a,)2D(a 2 )2 * * • 2D(aj)2 

Hence, if we denote the difference between two sets X and Y by X-Y, then it 
can easily be she wn that 
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Let A be a totally ordered and countable set and let 
¥:AjX • • • XA m — »A be an order-preserving mapping. Then, for all a,b<A, a 
covers b implies, for all q.reAjX • • • XA m , 

(1) ¥(q) — ¥(r) -* a if and only if 
both q and r belong to C(b)-C(a), 

(2) ^(q) * ¥(r) * b if and only if 
both q and r belong to D(b)-D(a). 

The above results imply that, when evaluating system performability 
based on a phased model, the sets C(a) and D(a) where aeA can be used as 
building blocks for describing events that characterize system performance. In 
particular, given an order-preserving mapping ¥:AiX • • • XA m — *A and a closed 
interval B~[a,b]£A, we show in the following theorems that can be 

specified by D(a) and C(b). The set ¥~ ! (B) has practical significance in 
performability modeling because its probability quantifies the ability of S to 
perform within the specified limits a and b. 

First, we note that Cartesian subsets of AjX • • • XA m are amenable 
to iterative methods of evaluation. A subset VCAjX-'-XAg, is called a 
Cartesian subset if V— $j(V)X • • ♦ X£ m (V) where $ k (V) is the projection of V 
onto its k tb coordinate. To illustrate the use of Cartesian sets in the evaluation 
of performabilities, let us consider the evaluation of a specific Cartesian subset 
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B k -{qeA 1 X--XA m |$ k (q)«{ k (V)} 


be the set of elements in A]X * * * XA m that assume values in £ k (V) at the 
coordinate. Then, the probability of B k can be expressed as a one-dimensional 
distribution of the phase k performance variables Y k , i.e.. 


PrlBj i Pr[(Y„Y 2 , ....YJ^W 
£ Pr[Y k «$ k (V)] . 

Accordingly, when V is a Cartesian set, V clearly can be represented as the 
intersection of those elementary sets B k , i.e., 


v - s !(v)x • • • x* m (V) 

m 

- n b,. 

k-l 


By iteratively applying the definition of conditional probability, then 
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Pr[V] - PrlBjn' BJ • MB^I f"? B k ] 

k-l k-1 

• •• PrtBjlB,] • Pr[B,] 

- Pr[Y m «{ m (V)lY m _i«{ m _ 1 (V), . . . , Y,«f,(V)] 

• ■ • Pr[Yj«f 2 (V)| Y| «{,(V)] • Pr[Y|«f,(V)] . 

( 5 . 10 ) 

Since each term in the product involves only elementary sets Bj^, we show in 
the following section that Pr[V] can be determined iteratively using matrix 
multiplications. 

An important class of Cartesian subsets of AjX • • • XA m is the sets 
of "intervals." For all q.rcAjX • • • XA m where q— (ai,a 2 ,...,a m ), r— (bj ,b 2 *--,b m ) 
and q<r, a closed interval [q,r] is defined to be 

[q,r] - {q'«A|X • • • XA m |q<q'<r} . 


It follows that 


[q.r] - $i(Iq,r])X • • • X£ m ([q,r]) 


- Uj.bJX • • • X[a ro ,b J , 


hence [q.r] is Cartesian. The open interval ( q,r), half-open intervals (q,r] and 
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[q,r) can be defined in a like manner. Moreover, if we assume that 
AjX * * * XA r has greatest and least elements I and 0, satisfying 


q ^ 0 and I ^ q 

for all q in AiX • • • XA m , then AjX • ♦ • XA,,, can be expressed as a closed 
interval [0,1]. 

Given an interval B of A, an important problem then is to find the 
"representation” of ¥~ , (B) in terms of intervals of AjX • • • XA m . The 
problem is interesting not only because the representation permits us to 
evaluate Pr[¥ -1 (B)] using the iterative algorithm described above but also 
because the representation reduces the number of elements needed to describe 
the set ¥ -1 (B). 

To express ¥ -1 (B) as intervals, we note first that since ¥ is order- 
preserving, for all q.reAjX • • * XA,,,, we have 

q < x < r implies ¥(q) 2 S ¥(x) < ¥(r) (5.11) 

(for all x«AjX • • • XA ffi ). Thus, if B is an interval of A and q,r<^ _1 (B), it 
follows that 


Mc* -1 (B) . 
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Moreover, for each a«A, let C(a) and D(a) be the sets as defined in (5.5) and 


(5.6), and denote the set complement of C(a) by ~C(a). If we define 


M(a) A {qcA]X * * * XA m |q is a maximal element of ~~C(a)} , 

(5.12) 


and 


m(a) A {qeAjX - * - XA r |q is a minimal element of D(a)} , 

(5.13) 


then, applying (5.11), it follows that 


Lemma: 

If i'rAjX • • • XA m — >A is order-preserving, then 


* l (Ia,b})2 U • 


(5.14) 


In general, the relation 2 in (5.14) can not be replaced by an 
equality. However, we found that the equality holds when AjX * * * XA m 
satisfies the chain conditions [48]; A partially ordered set L is said to satisfy the 
ascending chain condition when every nonempty subset of L has a maximal 
element. Similarly, L is said to satisfy the descending chain condition when every 
nonempty subset of L has a minimal element. L is said to satisfy the chain 
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conditions when it satisfies both chain conditions. When the chain conditions 
are imposed on the above lemma, we then have the following result: 

Theorem 5.2: 

If 'I'iAjX ■ • • XA m — »A is order-preserving and AjX • • • XA m 
satisfies the chain conditions, then for all a,b«A, 


¥ HU.b]) - (J [q,r] . 


(5.15) 


Proof: 

We only need to show that 

*''(Ia,b])C U M. 

i.e., for all xcAjX • • • XA m , we have to show that ar£¥(x)^b implies q<x<r, 
for some qem(a) and r€M(b). First, for each fixed x«AjX • • • XA m , let 

Kj * {yeAiX • • • XA m |^(y)<b and x<y} 


and 


K 2 — {y«AjX • • • XA m |as¥(y) and yiSx} . 
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Then K, and k 2 are clearly nonempty since x<K] and x«K 2 . Moreover, since 
AjX • • * XA m satisfies the chain conditions, Kj must satisfy the ascending 
chain condition and K 2 must satisfy the descending chain condition. Hence Kj 
contains a maximal element, say r, and K 2 contains a minimal element, say q, 
that is q^x^Sr, for some qcK 2 and r«Kj. Finally, we note that r is a maximal 
element of K] implies reM(b) and q is a minimal element of K 2 implies 
qcm(a), i.e., x«[q,r] for some qem(a) and reM(b). 

Note that, for each aeA, the set M(a) is an unordered set in the 
sense that, for all qj and q 2 in M(a), neither qi^fe norq 1 2rq 2 unless qj—q^ 
Hence, the number of elements in M(a) is bounded by the width of 
A]X * * • XA m defined to be a natural number n if and only if there is an 
unordered subset K of A}X * * * XA m of n elements such that all unordered 
subsets of A,X • * • XA m have no more than n elements [48]. Similarly, for all 
a«A, the cardinality of m(a) is no larger than the width of A jX * ♦ • XA m . 

Since a finite partially ordered set always satisfy the chain conditions 
and the width of a finite partially ordered set is finite, we also obtain the 
following result: 

Corollary: 

If Sk-.AjX • • • XA m — »A is order-preserving and AjX • • • XA,,, is 
finite, then for all intervals B in A, ¥ -1 (B) can be expressed as the union of a 
finite number of intervals in AjX • • • XA m . 
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In addition to the conditions of Theorem 5.2, Equation (5.15) also 
holds when ^AjX • • • XA m — *A is a lattice homomorphism. A partially 
ordered set L is called a lattice when any two of whose elements x and y have a 
least upper bound denoted by x V y, and a greatest lower bound denoted by 
xAy [48]. Clearly, every totally ordered set L is a lattice because, for any 
q*r«L, 


q V r 


q if q^r 
r otherwise, 


Moreover, since Aj,A 2 , . . . , A m are totally ordered sets, the least upper bound 
and the greatest lower bound of any two elements (ai,a 2 , . . . ,a m ) and 
(b 1 ,b 2 ,...,b m ) in AjX • • • XA m can be defined as 


( a l» a 2 a m) V (bj,b 2 bj 

& (aiAb^ajAbj a m Ab m ) . (5.16) 


and 


( a l* a 2 a m) A (b, ,b 2 

A (ajAbj.a^^ am^rn) • 


(5.17) 



Now, since AjX • • • XA m is a partially ordered set, it follows immediately from 
(5.16) and (5.17) that A t X * * * XA m is also a lattice. 

When the mapping SkrAjX • • • XA^— *A is a lattice homomorphism in 
the sense that, for all q,r«AjX • • • XA„, 

¥(qVr) - ¥(q) V ¥(r) 


and 


S^(qAr) — *(q) A ¥(r) » 


we are able to show that 
Theorem 5.3: 

If SfriAjX • • • XA,,,— »A is a lattice homomorphism and 

AjX • • • XA m satisfies the chain conditions, then, for any interval B in A, 
¥ _, (B) can be expressed as an interval of A|X • • • XA,,, in one and only one 
way. 

Proof: 

We note first that a lattice homomorphism is order-preserving 


because, when q^r. 
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¥(qAr) - ¥(q) A ¥(r) - ¥(r) 
and 

*(qVr) - *(q) V *(r) - *(q) . 

Moreover, given an interval B in A, Sfr" 1 (B) contains at most one maximal 
element. To provide this statement, let us suppose that ¥~*(B) contains two 
maximal elements q and r. Since A is a totally ordered set, we have either 
*(q)^¥(r) or ^(q)^^(r). If we assume ♦(q)^¥(r), then 

¥(qVr) - *(q) V ¥(r) - *(q) . 

But this implies q can not be a maximal element of ^ -I (B) unless qVr-q or, 
equivalently, q>r. However, since r is also a maximal element of ¥ -1 (B), 
q2rr implies q— r. In other words, Sk^CB) contains at most one maximal 
element. By similar argument, we can also show that contains at most 

one minimal element. 

If we assume that ♦~ , (B) is nonempty then, since A|X • • • XA m 
satisfies the chain conditions, ¥ -1 (B) contains a unique maximal element, say 
q 2 , and a unique minimal element, say q t . Now since ¥ is order-preserving. 


we have 
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* _1 (B) 2 lq„<l2] • 


Thus, it remains to be shown that 

♦“‘(B) C Iqj.qJ . 

i.e., x«'i r “ , (B) implies q^x^qj. This follows immediately since q 2 is also a 
maximal element of the set {y«¥ _1 (B)|y2:x} and qj is a minimal element of 
the set {y« , i r “ I (B)|y^x}. 

Theorem 5.2 and 5.3, in our opinion, have established a feasible 
method for the evaluation of system performability based on the notion of a 
phased model. To illustrate the application of the method, let us consider the 
following hypothetical two-phase model. During each phase, it is assumed that 
the phase k performance variable is given by 

Y k :Q— A k (k—1 or 2) 


where A k — {0,1,2} and 
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Y k 


2 if the sjs*cm operates in the nondegraded 
mode th>:ovghout phase k, 

1 if the sysf vi- enters degraded mode during 
phase k v. i\i r ‘ remains operational throughout 
the phase, 


0 otherwise. 


It is further assumed that the performance variable of the total system S is 
given by 


Y:fl — * A 


where A~{0,1,2} and 


% 


Y* 


2 if the system operates in the nondegraded 
mode throughout all phases, 

1 if the system remains operational for at 
least one phase 
0 otherwise. 


Then, the organizing structure can be expressed as an order-preserving 
mapping 


♦:(0,1,2)X{0,1|2] — • {0,1,2) 


as illustrated in Figure 5.1. Note that in the figure the domain and the 


codomain of the mapping are both represented by diagrams where two nodes q 
and r are connected by a downward line when q covers r. 

Suppc:e that the user is interested in evaluating the probability of 
the event Y — 1 or 0, i.e., the probability of encountering degraded 
performance or system failure. Applying Theorem 5.2, the event I ([0, 1]) 
can be expressed as 


^([O.l]) - [(0,0),(2,1)] U [(0,0),(1,2)J 

where [(0,0), (2,1)] and [(0,0), (1,2)] are intervals of the partially ordered set 
[0,1,2}X{0,1,2}. Note that the intersection of [(0,0), (2,1)] and [(0,0), (1,2)] is 
also an interval [(0,0), (1,1)]. Accordingly, the probability of f^dO,!]) can be 
expressed as 


Pr[* _1 ([0,l])] 

- Pr[[0,0),(2,l)] U [(0,0), (1,2)]] 

- Pr[[(0,0),(2,l)]] + Pr[[(0,0),(l,2)]] 

-Pr[[(0,0),(l,l)]] . 

Since an interval of [0,1,2}X{0,1,2] is a Cartesian subset, each of the last three 
terms of the above equality can be evaluated iteratively using the solution 
method described in Section 5.3 for Cartesian subsets. 
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Figure 5.1 

An Order- Preserving Mapping 

*:{0,1,2) J — {0,1,2} 
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In general, given an order-preserving mapping ¥:AjX • • • XA m — *A 
and an interval B of A, let us suppose ¥ _1 (B) can be expressed as the set 
union of a finite number of intervals Ij,I 2 In» i-e.. 


U I2 U •• • U In- 


To generalize the above evaluation method, let us define 


Si - 2 Pfft) 

i 

^ - 2 n y 

ij 

s 3 - 2 wii n ij n ij 

ij.k 


Sn-ptii, n i 2 n ••• n w 


where l<i<j<k<...<N so that in the sums each combination appears once 

and only once; hence Sy has ({] terms. Then, since the intersection of 
intervals is an interval, each of the sums can be evaluated by repeatedly 
applying the computational algorithms described in the following section. 
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Hence, by the method of inclusion and exclusion (see [51], p. 89), the 
probability of ¥~ ! (B) can be computed by the well-known formula 


Pr[^ _, (B)] - S, - S 2 + S 3 - S 4 + • • • ± Sm . (5.18) 


5.3 Probability Computation of Cartesian Trajectory Sets 

Given a phased model (X,?) satisfying assumptions (5.1) and (5.2) 
the performability model can be simplified as follows. The simplified base 
model is taken to be the imbedded discrete-time process 


Xs-(Y k lk-l,2 m) 


where, for each k— l,2,...,m, Y k is the phase k performance variable. The 
trajectory space of X can be effectively regarded as the product space 


U - A,XA 2 X • • • XA m 


where A k is the accomplishment set of phase k. The corresponding 
simplification of y is the organizing structure 
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as defined in (5.2). Then, it follows that, for all a in A, the probability that S 
performs at level a is 


perf(a) - Pr[y l (a)l “ Pr[¥ _, (a)3 

and hence the performability model (Xg,*) 6111 be used to evaluate the 
performability of S. We will thus refer to (X,¥) as being equivalent to the 
model (X,y). 

A 

Generally, given an equivalent performability model (X s ,¥), the 
evaluation of Prt^'^a)] requires a detailed knowledge of how intraphase 
processes cooperate to accomplish level a, i.e., a thorough understanding of 
their functional dependencies (see [46]). The difficulties are further aggravated 
by statistical dependencies between phases. However, we show in the following 
discussions that when a trajectory set V£U is Cartesian in the sense that, for 
every phase k, there exists R k £A k such that V—RjXRjX • • • XR m , then Pr[V] 
can be determined iteratively using matrix multiplications. Moreover, given 
this ability to compute the probabilities of Cartesian sets, the probabilities of 
more general sets can be determined by decomposing them into Cartesian 
components (see (5.18)). Hence, the problem reduces to that of computing 
th^ probabilities of Cartesian trajectory sets. 

If, for each phase k, let n k be the number of states in Q k . Then, for 
a Cartesian trajectory set V—RjXI^X • • • XR,,,, the conditional 
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intraphase transition matrix of the k th phase is the n k Xn k matrix P Vfk where, 
for all i j«Q k , 


Pv, k 0J) “ Pr[Y k «R k , Xi- 5 lX‘. 1 -^,Y k _ 1 «R k _ 1 Y, t R,J (5.19) 

where Xj_, and X£ respectively are the initial and the final states of phase k 
intraphase process (see (5.1)). In other words, Pv >k (ij) is the probability of 
having performance levels R k during phase k while the intraphase process 
initiates in state i and ends up in state j, conditional by the first k-1 components 
of V. Similarly, for all but the first phase, the conditional interphase transition 
matrix is the n k _jXn k matrix H V|k where, for all ieQ^i and all jeQ k , 


Hv, k (i j) - PrtX t k l _ 1 rjlX£; 1 -i,Y k _ 1 ( R k _! Y^R,) . (5.20) 

In other words, H Vfk (i j) is the probability that the k th phase initiates in state j 
given that the final state of the k— 1 th phase is i, conditioned by the first k-1 
components of V. Finally, for consistency, we let H V1 be the njXni identity 
matrix. In terms of the above matrices, we are able to establish the following 
matrix formula for computing the probability of a Cartesian trajectory set V. 
Given X, let p denotes the initial state distribution, i.e., p — [pi p 2 * * * p n J 
where pj — Pr[X<J™il, and let F k denote the n k Xl column matrix with "1" in 
each entry. Then, by induction on k, it can be established that 
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Theorem 5.4: 



PAGil ft 

QUALITY 


If V *■ RjX • • • XRfcXQfc+jX • • • XQ m then 


Pr[V] ~ p [ n H vr Pv,,3 F k 


(5.21) 


Proof: 
For k— 


P‘ H v,rPv,i P'Py.i " Ui ’ • ' aj • • • a 0j ] 


where 


^-2 PrtX<i-i]'Pr[Y,.R 1 ,X t l 1 -jlX 0 1 -i] 
*<Qi 


- 2 Pr[Y 1< R I ,X t 1 1 -j,X 0 1 -i] 
i*Qi 


- PrlY^R,^] . 


Multiplied by Fj, 


* ljy 


P* H v,fPv,fFi 



* '' ’ ’• A 


%. * i 


L1TY 


- 2 Pr fYi«Ri.X t , l -j] - PrtYjcRj 
j«Qi 


- Pr[Y,«R„Y 2 «Q 2 Y m «QJ 


- Pr[V] . 


Suppose that the formula holds for k<m, then 

k+l 

P*[|5, H v.«’ p v,f]*Fk+i 

k 

- p-^n H Vf j P Vf |] H y k+1 -P v k+ i • F k+1 

“ Ai'H Vtk +i 'Py.k+i 'Ffrfi 

where 


Aj — [bj • • bj • • b ni ] 


and 


bj - Pr[X t k k -^,Y k cR k , . . . , Y,€T,] 


by applying the equation for k. 
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When we iteratively compute the matrix product, beginning from 
the left, the first two terms become 

A 2 — AfHv.k+i “ Ici • • Cj • • c nk+1 J 


where 


Cj - 2 b i’ H V,k+lOJ) 
i«Qk 

- 2 Pr[X*-i,Y k <R k , . . . , Y^R,] 
»«Qi 


PrtX'H +1 -jlX‘-i,Y k< R k , . . . ,Y, t R,J 


Pr[Xi + S.Y k «R k , 


,Y,«R,] . 


The next partial product is the result of multiplying A 2 by the 
transition matrix Py,k+i which yields: 


A 3 - A 2 P Vfk+ i - [dj • • dj • • • d Bk J 


where 
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*^“2 ^'Pv.k+l 0 J) 

••Qh-i 

- S PrfX^M.Y^Rk YjcR,] 

«*Qk+i 

‘ p r[Y k+1 €R k+I ,X t k +HIX t ^ , -i t Y k €R k Y,eR,] 

“ PrlX t ^ 1 -3,Y k+1 €R k+1 ,.. M Y 1 «R,] . 


The product is completed by multiplying A 3 by the summing vector 
F k +j, that is, 


k+1 

p H v,r p v,fJ Fk+i 

“ A 3 p k+I 

“ 2 Pr ^^t^ 1, Tj*Y k +ieR k+ j,...,Yi<Ri] 
j«Qk+i 

™ Pr[Y k+ i«R k+I ,Y k <R k ....YicRj] 

“ Pr [Y I €Ri,...,Y k+1 «R k+lf Y k+2< Q k+2 Y m «Qj 

- Pr[RjX • • • XR k +,XQ k+2 X • • • XQj . 

Accordingly, the equation holds for all k^m, which completes the proof of 
Theorem 5.4. 
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In particular, lor k — m, we have 


Corollary: 

For any Cartesian set V—KjXRjX • • • XR m , 


Pr[V] - p [ n H v . k P v ,k] F ia . (5.22) 

k"*l 

Although Equation (5.22) provides us with a general formula for 
computing the probability of a Cartesian set, its disadvantages derive from the 
fact that the H Vtk and P v>k matrices may be difficult to obtain in practical 
applications. In particular, these matrices will generally depend on V as well as 
X and, moreover, will generally depend on the history of X before phase k. 
However, the latter objections disappear when the transition probabilities are 
"memoryless." More precisely, let the (unconditional) intraphase transition 
matrix of the k lh phase be the n k Xn k matrix P v>k where, for all i j«Q k , 

P V .k(' j) " Pr(X,H Y k «R k IX£.,-i] (5.23) 

i.e., the probability of having performance levels R k during phase k while the 
intraphase process initiates in state i and ends up in state j. Similarly, let the 
(unconditional) interphase transition matrix be the n k _tXn k matrix H k where, 
for all ieQk—i and jcQk, 
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H k (ij) - I . IX^-jlXiTZ-i] (5.24) 

i.c., the probability that the intraphase process initiates in state j given that 
the k— 1 th intraphase process ends up in state i. Then the intraphase transitions 
of (X,y) are memoryless for V at phase k if 

P V k ** P v , k . 

Similarly, the interphase transitions of (X,y) are memoryless for V at phase k if 

H v ,k - H k . 

Accordingly, when transitions are memoryless through phase k, by the 
definitions and Theorem 5.4, we obtain 

Theorem 5.5: 

If V — RjXR 2 X • • ■ XI^XQ^jX • • • XQ m and the intraphase and 
interphase transitions of X are memoryless for V through phase k, then 

Pr[V] - p [ n H| P Vf |] F k . (5.25) 


Corollary: 
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For any Cartesian «et V, if the intraphase and interphase transitions 
of X a;e memoryless for V for all phases, then 

Pr[V] - p-ln Hj-Py^J-Fk . (5.26) 

When V is a Cartesian set and Rj "* Qf, for 4 — l,2,...,k— 1, then the intraphase 
and interphase transitions of X are memoryless for V through phase k. 
Accordingly, applying Theorem S.5, we obtain the following formula for the 
)bability of the trajectory set V — QjX • • • XQ k _ 1 XR k XQ k+ jXQ ffi which, 
alternatively, is the probability of the event Y k «R k . 

Theorem 5.6: 

If V - Q,X • •• XQ k _ j XR k XQ k + 1 X • • • XQ., then 


Pr[V] - p-[n K„ Pv,,] F k . 


(5.27) 


By Theorems 5.5 and 5.6, when certain intraphase and interphase 
transitions are memoryless for V, the probability of a Cartesian set V is easily 
obtainable. However, such results may still be difficult to use because, even 
though the transitions are memoryless for V, they may not be memoryless for 
other Cartesian sets. Accordingly, we have sought to identify stronger 
conditions under which the formulas will hold for id! Cartesian trajectory sets. 
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First, by extending previous definitions, the intraphase (luterphase) transitions 
of (X/y) are memoryless at phase k if they are memoryless for all Cartesian sets 
V at phase k; the intraphase (interphase) transitions of (X,y) are memoryless if 
they are memoryless at all phases. The advantages of memoryless transitions 
are obvious, for by their definitions and the corollary to Theorem 5.4, we have 

Theorem 5.7: 

If (X,y) is a phased model and the intraphase and interphase 
transitions of X are memoryless, then, for all Cartesian sets V, 


Pr[V] - p [ n H k P v k 3*F m . (5.28) 

k-1 

Moreover, we find that the memoryless property is relatively easy to 
characterize, that is, we are able to show the following characterization 
conditions for the memoryless property. Note that the conditions do not 
involve any specific Cartesian sets. 

Theorem 5.8: 

(1) The intraphase transitions of X are memoryless at phase k if and only 
if, for all ij«Q k and all aj in Af (jf — l,2,...,k). 
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PrlXtH’Yk-aklX^-i^^j-a^j Yi-aJ 

- Pr[X t H.Yk-aklxi.,-i3 . (5.29) 

(2) The interphase transitions of X are memoryless at phase k if and only 
if, for all ieQk-], jcQk and all a^Aj (j — l,2,...,k— 1), 

ftlXi,Hlx£'-i,Y k _ 1 -a k _ 1 Y,— a,] 

- PHX^-jlX^'-i] . (5.30) 

Proof: 

Suppose P Vk is memoryless for all Cartesian sets 
V — R^XRjX • • • XR m . By taking Rj to be the singleton set {a^} (fl “ l,2,...,k), 


Pv t k(U) - Pr [X^ , Y k “a k I X^H , Y k _ j — Yj-aJ 

- Pr[X^-j,Y k -a k |X t k k _ ( -i] 

“ Pv,k(*j) • 

Now, suppose that, for all aj«Af ((? — l,2,...,k). 
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p rtX^,Y k -a k |X t k k . 1 -i,Y k -,-a k , I Y t — a,] 

- iMX^.Yk-aJX^-i] . 


Then, for any Cartesian set V - RjXR 2 X • • • XR,,,, 


Pv,kOj> " Pr[X t '^i,Y k «R k lx‘. 1 -i,Y fc - 1 


cR 


k-l»— 


,Y l€ R,] 


where 


and 


2 ®y(®i» * ' »®k)'di k (aj,...,a k ) 

PrlX^^i, Y k _!€R k _2, . . . ,Yj«Rj] 


cyfap...^) -PrlXt^.Yk-atlX^-i.Yfc-j-at.,, . . . ^-a,] 


<1^(8!,...^) “ Pr[X t k k _ I -i,Y k _ 1 -a k , . . . , Yj-aJ . 


Thus, by the assumption, Py.kOJ) * s equal to 


2 Pr[X t k t -j,Y k -a k |X k _ I -i] 

•icRi, . . . ,«k<Rk , 

♦Pr[X k _,-i,Y k _ 1 »a k - 1 Y t — a,I 

. . . ,Yj«RiJ 


P r fX t ^_ l “i,Y k _ 1 «R k -i , 
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Factoring out the term Pr[X£^j,Y k — aJX^—i], we have 

Pv.k(iJ) " P r [XtH Y k^klXi 1 -i] l 

“ Pv,kOJ) 

which completes the proof for part (1) of the theorem. Part(2) is proven in a 
like manner. 

5.4 Cptrsiioul Models as Intraphase Processes 

To illustrate the application of the above general results, we consider 
in this section performability evaluation methods assuming that the phased base 
model of a phased model (see (5.3)) is Markovian and that the intraphase 
performance variable are defined as the minimum operational rates experienced 
by the system during a particular phase (see (3.23)). The representation of 
intraphase processes as operational models permits us to calculate the 
intraphase transition probabilities using Equations (3.45) and (3.49). By 
formulating the intraphase performance variables in terms of functionals, we 
also obtain a considerable gain in expressive power of a phased model, 
particularly in representing the effects of intraphase "repair." Moreover, since 
the operational models are allowed to vary from phase to phase, the modeling 
of a particular phase can be tailored to the computational requirements of the 
phase. In other words, this special class of phased models can be regarded as 
the time varying versions of the models introduced in Chapter 3. 


t J 

QUALITY 
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5.4.1 Derivation of Intraphase Transition Probabilities 

Recall that, in order to represent the phased base model as a one- 
parameter family of random variables 

X - fXjt.T} , 

the augmented utilization period is taken to be 

T-T(J {t k / |k-l,2,...,m-l} 

where t“[0,h] is the original utilization period. Hence, to describe X as a 
Markov process, T must be specified as a totally ordered set by inheriting the 
ordering relation of T and by assuming t k <t k ', t k '<x, y<t k ' for all 
k— l,2,...,m-l, and all x,y«T such that t k <x and y<t k . By Theorem 5.8, the 
Markov assumptions imply that the intraphase transition and the interphase 
transitions of X are memoryless. Hence, applying (5.28), for all Cartesian sets 
V £ A,X • XA m , 


Pr[V]-p [n H k P Vk ]F m . (5.31) 

k— 1 

In other words, the probability of V can be expressed in terms of the initial 
distribution of the first phase, p“lpj p 2 * * • p n ,]. the interphase transition 
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H k (ij) - PrlX t k l _ 1 -j|Xi:, 1 -i] 

where k“l,2,...,m, ieQt-] and j«Qfc, and the intraphase transition probabilities 


Pv,k(iJ) - Pr[X t k -j,Y t «R k |X k .,-i] 

where k—l,2,...,m and ijcQk- Assuming p and H k (iJ) can be determined from 
the known properties of the system, then the problem of computing Pr[V] is 
reduced to that of computing the intraphase transition probabilities. 

To compute the intraphase transition probabilities, we assume that 
each phase of the mission is represented by an operational model. More 
precisely, for each phase k, the intraphase process 

X l -(X, k |t€T k ) 

is a time-homogeneous Markov process and the phase k performance variable 
Y k is given by 


Y k “ min{f k (X t )|tcT k } 
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where f k is an operational structure defined on the state space of X k . Clearly, 
the intraphase transition probability Py,k(iJ) 0211 be determined by 

tMx£-j.Y k -qlx£_,-i] (q«Q k ) 

which, in turn, can be determined by the conditional probabilities m-J(t) 
described in Section 3.2. More precisely, for each 
l<k<m and q«Qk“{1.2,...,n k } let us define a n t Xn k dimensional matrix 


* QU/.1 


M k> q “ [m£] 

where for all i j«Q k , 

mj - Pr[X t k -j,Y k -qlX t k_ | -i] . 


Then, since X k is a time-homogeneous Markov process, mj can be computed 
by applying either (3.45) or (3.49). Moreover, it follows that the intraphase 
transition probabilities can then be obtained from M k q by matrix additions, i.e., 


^*V,k(*d) ™ ^k,q • 
q«Rk 
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5.4.2 Evaluation of a Two-Phase Mission 

In constructing a phased model that can support an evaluation of 
system performability, we must specify the intraphase processes and an 
organizing structure with respect to a specific computer and its computational 
environment. The intraphase processes together with the organizing structure 
determine an equivalent performability model that can be evaluated using 
solution methods developed in Section 5.3. 

To illustrate these modeling and evaluation methods, a 
comprehensive phased model has been examined in [49], involving the SIFT 
computer with an environment taken to be the control of a transoceanic air 
transport mission. The model represents the internal structure of the SIFT 
computer as well as conditions of its environment in terms of Markov 
processes (see [49], Figure 3 and Table m). State trajectories of the equivalent 
base model are then related to accomplishment levels of the mission via a 
capability function (i.e., an organizing structure) which is formulated in terms 
of a three-level model hierarchy (see [49], Figure 2). After the capability 
function is formulated, solution methods are then applied to determine the 
performability of the total system. 

Although the performability modeling and evaluation effort of the 
SIFT computer has shown the essential aspects of the phase model method, it 
has emphasized the construction of realistic higher level models. Simple 
Markovian models for nonrepairable systems are used to reduce the complexity 


- 153 - 


of the performability calculation. Hence, to show the solution algorithms in 
more detail, we consider in this subsection a phased model that uses 
operational models as intraphase processes. 

We consider a total system S—(C,E) comprised of a control 
computer C operated in the environment E of a two-phase mission. The 
computer initially operates as a multiprocessor system with three identical 
components. However, system reconfiguration can occur in the computer due 
to phase change, hardware faults, or software faults. During the first phase 
T 1 “[t 0 ,t 1 j , all three subsystems are required to perform all computational tasks 
successfully. But, when less than three subsystems are available, the system 
can still survive by executing a reduced set of computational tasks. Depending 
on the amount of resources available, the system exhibits the following levels 
of accomplishment during phase 1: 


Phase 1 

accomplishment 

levels 

Interpretation 

3 

Full performance 

2 

Noncritical performance 
degradation 

1 

Critical performance 
degradation 

0 

Failure 


During the second phase T 2 "“It 1 ,t 2 ], the system is reconfigured into a TMR 
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system with software recovery to obtain a high degree of reliability (see Section 
3.4). Hence, the system exhibits the following three accomplishment levels: 


Phase 2 

accomplishment 

levels 

Interpretation 

2 

Full performance 

1 

Degraded performance 

0 

Failure 


To describe aspects of the system performance that the users 
consider important, we assume that users are interested in distinguishing only 
three levels of mission performance A — (2,1,0), where the accomplishment 
levels convey the following information: 


Mission 

accomplishment 

levels 

Interpretation 

2 

Full performance 

1 

Degraded performance 

0 

Failure 


We further assume the following characteristics of the mission: 


1) To achieve level 2 mission accomplishment, if the performance 
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degraded noncritically during the first phase, then performance 
degradation is not permitted during the second phase. If phase 1 
performance is not degraded, then phase 2 performance is allowed to 
degrade. 

2) Mission performance is degraded if phase 1 performance is noncritically 
degraded and phase 2 performance is degraded. A critically-degraded 
performance in phase 1 will result in level 1 mission accomplishment 
only if phase 2 performance is not degraded. 

3) Mission accomplishment level is 0 if the system enters the failure mode 
during any phase. 

Under the above assumptions, the organizing structure of the phased model 
(i.e., the capability function of the equivalent performability model) can be 
tabulated as in Table S.l. Clearly, the organizing structure ¥:A}XA 2 — *A is 
order-preserving and satisfies the conditions of Theorem 5.2. 
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Accomplishment Levels 



Mission 

Phase 1 

Phase 2 

(*) 

3 

2 

2 

3 

1 

2 

3 

0 

0 

2 

2 

2 

2 

1 

1 

2 

0 

0 

1 

2 

1 

1 

1 

0 

1 

0 

0 

0 

2 

0 

0 

1 

0 

0 

0 

0 


Tablc 5.1 

An Organizing £*p*.«,iure 
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Hence, for each mission accomplishment level aeA, ¥ -1 (a) can be expressed as 
a finite union of Cartesian subsets of A t XA 2 . In particular, one such 
representation is 


*-'(2) - <3)X{1,2) U {2)X{2| (5.32) 

*->(I)“{2)X|l) (J (1)X{2) (5.33) 

* _, (0)-fO)X10,l,2) (J (0,1)X(0,U U 10)XI0.1.2,3) (5.34) 

Note that each Cartesian set in the above equalities is an interval as defined in 
Section 5.2. 

To specify the equivalent base model, it is necessary that the state 
spaces of the intraphase processes be refined enough to support the evaluation 
of system performability. This condition can be satisfied if the phased base 
model (5.3) is chosen to be a time-homogeneous Markov process with a state 
space detailed enough to distinguish different operational modes of the 
intraphase processes. In other words, even though the operational models vary 
from phase to phase, they share the same underlying Markov process 
throughout the whole mission. 

Assume that the computer has the same failv-e characteristics as the 
one considered in Section 3.4. Then, if we denote the model parameters by 
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X — component failure rate 

a — software failure rate 

c — hardware error coverage 

d — software error coverage 

dp — rate of software recovery, 

a common time-homogeneous Markov process for both phases can be specified 
as in Figure 5.2. Each state of the graph (except state 0) represents a specific 
number of subsystems that are free from hardware faults; a prime O is 
appended to the number if the system is attempting recovery from a software 
error. State 0 represents any other configurations. Using this Markov process, 
the probabilistic nature of phase 1 and phase 2 can be represented, respectively, 
by operational models induced by the operational structures as illustrated in 
Table 5.2. Note that the operational rates are chosen to convey the same 
information as the accomplishment levels of each phase so that the intraphsse 
performance variables Y k (k—1 n 2) can be defined as 

Y k - min{f k (X t )lt«T k ) 

where f k it the operational structure of the lt lh phase. 

To compute the performabi'ity of the total systems, note also that 
the use of a common Markov process for all phases implies that each 
interphase transition matrix is an identity matrix. Accordingly, once the initial 
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distribution of the first phase model is known, the probabilities of Cartesian 


sets can. be determined by repeatedly applying (S.31). First, let us assume the 


following parameter values: 


Parameter 

Value 

A 

5X10 -4 

a 

10' 2 

M 

10 3 

C 

.99999 

d 

.9 


Then, if we assume that the duration of the two phases are both 10 hours, we 


have, by (5.31), 



.89137 0 0 0 0 0 
0 0 0 0 0 0 

0 0 0 0 0 0 

0 0 0 0 0 0 * 

0 0 0 0 0 0 

. 0 0 0 0 0 0 . 


M u 


.08393 

0 

.87777 

0 

0 

0 


.01402 

.89583 

.01262 

0 

0 

0 


.00001 

0 

.00001 

0 

0 

0 


0 0 0 

0 0 0 

0 0 0 

0 0 0 

0 0 0 

0 0 0 . 


t 
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0 

0 



LO 


.00065 0 
.08436 0 
.00058 0 
.88218 0 
0 0 
0 0 


.00005 0 
.00005 0 
.00846 0 
.00846 0 
.90032 0 
0 Oj 


.00001 

.00001 

0 

.00001 

0 

0 


♦ 



89137 .01338 0000 
0 .89583 0 0 0 0 

0 0 0 0 0 0 

0 0 0 0 0 0 

0 0 0 0 0 0 

0 0 0 0 0 0 . 


* 


and 



.08393 .00065 .00001 0 0 0 
0 0 0 0 0 0 

.87777 .01262 .00001 0 0 0 
0 0 0 0 0 0 

0 0 0 0 0 0 

. 0 0 0 0 0 0 . 


Accordingly, the trajectory sets in (5.32) and (5.33) can be 
evaluated by applying (5.28) and (5.18), i.e., 


Pr[*-‘(2)] - Pr[{3|Xfl,2)] + Pr[{2)X{2H 


P'Mij-(Mj t i+M lt2 )'F + pMij-Mjj-F 


and 
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Pr[*-'(1)] - Pr[(2)X{lJ] + Pr[{l)X{2}] 


- P'Mu'MjjF + 


where p is the initial distribution of the first phase and F is a 6X1 dimensional 
matrix with a *1" on every entry. In particular, if the computer is fault-free at 
the beginning, i.e., p— [l 0 0 0 0 0], the performability of the phased mission is 
as follows: 


perf(2) - Pr[^~ 1 (2)] - .97036 
perf(l) - Pr^O)] - .00769 
perf(0) * Prt^-^O)] ™ .02195 . 



CHAPTER 6 


CONCLUSION AND FURTHER RESEARCH 


The objective of this research has been to develop a general 
stochastic process model for evaluating the performability of degradable 
computing systems. This objective was established to fulfill the need of 
evaluating the unified performance and reliability of distributed multiprocessor 
systems. To accomplish the objective, a precise formulation of system 
performance is developed in a broad context and the concept is then applied to 
analyze the performance of degradable computing systems. Furthermore, a 
simple and useful user-oriented performance variable is identified and shown to 
be a proper generalization of the traditional notions of system performance and 
reliability. 

In addition to the above modeling framework, a specific two-level 
hierarchical model is developed. The model is constructed according to a 
hierarchical decomposition of a system’s behavior: Priority queueing models are 
used to analyze the system’s detailed program behavior and the results are 
combined via a Markov reward process to characterize the overall system 
performance. Although the modeling approach resembles the top-down 
structured approach of software development, the decomposition considered 
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here is based on a more precise classification of a system’s short term and long 
term equilibrium behavior. Accoroingly, the modeling approach permits the 
evaluation of a computing system’s hardware and software as a whole, and it 
becomes possible to deal with the performance and the reliability of a 
computing system simultaneously to measure the extent to which the user can 
benefit from tasks accomplished by the computer. 

Finally, a time-varying version of the model is considered to analyze 
the performance of phased missions. By representing intraphase models in 
terms of operational models, we are able to obtain useful results even without 
the typical no-repair assumption of the traditional phased- mission reliability 
methods. Moreover, since the model considered does not require the structure 
function representation of system success, the approach thus represents an 
important generalization of traditional fault-tree analysis. 

Although the investigation efforts documented in this thesis were 
carried to the point where the research objectives described in Section 1.2 were 
satisfactorily accomplished, there remain several problems that must be 
resolved before the performability modeling techniques can become a major 
tool in the design and analysis of computing systems. 

First, to extend the usefulness of operational models, more efforts 
can be made to formulate various user-oriented performance variables that are 
suitable for a wide variety of computer applications. Since solution methods for 
these performance variables may differ considerably from those obtained in this 
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study, new solution techniques should also be explored. 

Several generalizations of the phased model are possible depending 
on how the notion of phasing is relaxed. For example, by allowing the duration 
of each phase to vary, the model can be extended to a larger class of systems 
with time-varying environments. One may also extend the phased model by 
allowing the decision on selecting a succeeding phase to be made at the time of 
a phase change to improve the mission performance. 

Another important problem that may have significant influence on 
the modeling of degradable computing systems is the modeling of software 
faults. The problem becomes even more interesting when both software and 
hardware faults are considered simultaneously. Note that the model considered 
in Chapter 4 measures the effect of hardware faults while taking into account 
the behavior of the system software. On the other hand, software reliability 
models (see [5 5]- [56], for example) are typically concerned with the effect of 
software faults on the system performance assuming that the hardware is fault- 
free. Clearly, useful performance measures can be obtained by combining 
performability with the results of software reliability analysis. 

Finally, we note that the hierarchical decomposition method 
considered in Chapter 4 may also be extended to the performance modeling of 
computing systems in general in addition to the performance modeling of 
degradable computing systems. By classifying the physical and logical resources 
of a computing system according to their frequency of accesses, various models 
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can be constructed to facilitate the step-by-step approximation of system 
performance. To make the approach useful, however, it then becomes 
necessary to have a better understanding of the error bound of the 
approximation. 
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MARKOVIAN FUNCTIONALS OF 
MARKOV PROCESSES 


This appendix is reference material for Chapter 3. Conditions under 
which operational models become Markovian are stated in terms of slightly 
modified forms of what can be found in the literature. The utilization of such 
conditions together with their limitations are illustrated through examples. 

When modeling computing systems as operational models, there are 
many situations in which the state space of the underlying base model may be 
much larger than needed to distinguish operational rates via the operational 
structure. Accordingly, to simplify the evaluation of system performability, one 
question that arises naturally is whether the operational models can be 
described as Markov processes and, if so, whether they are time-homogeneous. 
More precisely, let us suppose the total system is modeled by a time- 
homogeneous Markov process 


X - {Xj0^s:St} 


with a denumerable state space Q and, relative to an operational structure 
f:Q-*R, 
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Z-{f(X,)|0<s<t}. 

Since the above question does not involve the actual values of f, Z may be 
regarded here as a "lumped” [52] version of X, where states i and j &.re in the 
same lump if and only if f(i)“f(j). In other words, lumps coincide with the 
operational modes of S. If f is 1-1 then Z is obviously both Markovian and 
time-homogeneous since, in this case the lumping is trivial. If i is properly a 
many-to-one function, the answer is no longer obvious, and indeed the 
question needs further clarification. 

To begin, let us suppose that the underlying process X is specified 
by its generator matrix A and an initial distribution p. Then our original 
inquiry can be reduced to the following questions: 

Ql) Given A, p and f, is Z Markovian? 

In many applications, however, one wants the freedom to alter the initial 
distribution p without losing the Markov property. In this case we are asking: 

Q2) Given A and f, is Z Markovian for arbitrary p? 

Finally, we can raise our sights even higher and ask: 

Q3) Given A and f, is Z Markovian for arbitrary r» and, 
moreover, is the transition function of Z 
independent of p? 
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Adopting the terminology of [52] (which investigates the discrete-time versions 
of Q1 and Q3), if the answer to Q1 is "yes" then the process X specified by A 
and p is weakly lumpable (with respect to 0 - A "yes* answer to Q2 is stronger 
but, generally, these Markov processes will not be time-homogeneous. If the 
answer to Q3 is "yes" then, for all initial distributions p, the Markov processes 
Z have the same transition function and, by the homogeneity of X, it follows 
that this function is invariant under time shifts, i.e., the processes are time- 
homogeneous. In this case we say that the processes X specified by A are 
strongly lumpable (with respect to f). 

Addressing first the question of weak lumpability (Ql), if Z is to be 
a Markov process, we must insure it ha* the "memoryless* property, that is, any 
sequence of past observations of Z provides the same information as the last of 
those observations. To formalize this requirement, if 02Stj<t 2 < • ■ • <t k is a 
sequence of observation times and q^Q* f(Q) is the state of Z observed at time 
tj, for each underlying state jcQ, let 


Mj(t, t k ;q, q k ) - Pr[X tfc — jiZ,,— qi Z^-qJ (A.l) 


then the 1X|Q| matrix 


M(tj, . . . ,t k ;q t ,...,q k ) - [Mj( ; )] 
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is the probability distribution of the states of X at time t k , r * s conditioned by 
these observations of Z. In particular, since it follows that Mj( ; ) is 

nonzero only if f(j)“"<lk» {Mj( ; )lf(j)"Qk) gives the probability 
distribution of states inside the lump f" 1 ^). Moreover, since X is a Markov 
process, Mj( ; ) permits us to represent conditional probabilities in terms of the 
transition function of X, i.e., it can be shown that, for all 
0^ti<t 2 < • • • <t k — t and s^O, 

Mj(tj, . . . » t k ,qj»..-*q k )‘P(s) 

- [ti t 2 • • ■ ) (A.2) 


where, for each i«Q, 


Tj - PrlX t+# -i|Z ll -q l , . . . ,Z,-q k ] . 


To translate distributions of X back up into distributions of Z 
relative to some specified ordering of the lumps, let Qj denote the j} tb lump, 
i.e., the collection of sets 


{Qi 1 1 ^ j 5S Iq|} 


is the partition of Q induced by f. Accordingly, if [ti t 2 


• • 


• ] is a 
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probability distribution over the states of X, we let x denote the corresponding 
distribution, that is 


x — [i 


1 »2 


(A.3) 


where 


»j“S T i (lssjjslQl). 

i*Qj 

In terms of the above notation, weak lumpability can then be characterized 
similar to its discrete time analog. 

Theorem A.1: 

Let X be a time-homogeneous Markov process with transition 
function P and a fixed initial state probability distribution p. Let Z be a 
functional of X and, for all t>0, define 

A p (t) - {M(tj t k ;q h . . . ,<v £ )|0<t,< • • • <t k «t and qj q k «Q} . 

Then, Z is a Markov process if and only if, for all s>0 and for all r,T'«A p (t), 


x ™ x* 


implies » P(s) — x'P(s) . 


(A.4) 
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Theorem A.l can be proved in a way similar to that of its discrete- 
time analog (see [52]; pp. 132-134). To illustrate its application, let us suppose 

% 

that the system in question is a multicomputer comprised of three identical 
computer modules. Suppose further that modules fail independently and that 
each fails permanently with a constant failure rate X. Then we can take the 
base model X to be the Markov process depicted by the state-transition-rate 
diagram of Figure A.l. 



Figure A.l 

Markov Model of a Multicomputer 

As for operational rates, let us assume they are normalized so that, at full 
capacity the rate is 1, and with the loss of one or two modules the rate is 1/2; 
loss of a third module results in total failure. Accordingly, the operational 


structure here is the function 
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i 

f(i) 

1 

1 

2 

1/2 

3 

1/2 

4 

0 


and hence the functional Z takes values in the state set Q~{1, 1/2,0}. On taking 
the inverse of f, these states correspond as follows to lumps cf Q: 

1 -0} 

1/2 - {2,3} 

0 - {4}. 

If we now examine the probabilistic nature of Z, we find that the conditional 
probability PrlZ^^OlZ^l^] depends on the time that Z enters state 1/2 from 
state 1 if the latter event is possible (i.e., if the probability of initially being in 
state 1 is nonzero). Thus, for example, if X is initially in state 1 with 
probability 1, i.e., p=*[l 0 0 0] is the initial state probability distribution, then 
we have such a dependence (on the past history of Z) and therefore Z is not a 
Markov process. On the other hand, let us suppose the initial distribution is 
p=« (0 0 10], which r not a likely choice from a functional point of view, but it 
serves to illustrate the role of p. In this case A p (t) (as defined in the statement 
oi Theorem A.l) is the same for any time t in T, i.e., it is the set 
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Ap(t) - {[0010], [0001]} . 

Accordingly, the conditions of Theorem A.l are vacuously satisfied, and 
therefore Z is a Markov process for this choice of p. Moreover, it should be 
obvious that Z, is this case, is time-homogeneous. Other distributions, such as 
p*[0 1 0 0] can be shown to result in Markov processes that are not time- 
homogeneous. 

Regarding the second question (Q2), a necessary and sufficient 
condition can be obtained by extending the previous theorem to arbitrary initial 
distributions. More precisely, it can be shown that Z is a Markov process 
whatever the initial distribution if and only if, for all t,s>0 and any initial 
distribution p, condition (A.4) holds for all r, x' in A p (t). Although the 
characterization is useful from a conceptual point of view, it is difficult to use in 
practical applications. A more desirable form of this result can be found in [53] 
(pp. 1113-1114, Theorem 4) which assumes that X has a unite state space. 
(The theorem was generalized later in [54] to allow for arbitrary state space.) 
Stating the desired form of this result in terms of the notation defined abovs 
we have: 

Theorem A.2: 

Let X be a time-homogeneous Markov process with generator 
matrix A^Uy] and let Z be a functional of X determined by f. Then Z is a 
Markov process, whatever the initial distribution of X, if and only if for each 
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q«Q taken separately either 


Ci 

OF 



Q'wAUTY 


(i) For all ij«Q such that f(i)#q and f(j)“q, 

ay “ o 
or 

(ii) For all reQ such that r*=q, the sum 


2 

fO)-r 


is the same for all uQ such that f(i)~q. 

Although the conditions of Theorem A. 2 guarantee that Z is a 
Markov process relative to any initial distribution p for X, note that the specific 
nature of Z (as specified by its transition function) will generally depend on p. 
Moreover, the process Z need not be time-homogeneous. 

To illustrate Theorem A.2 and the above observations, let us again 
consider the Markov process X having the state-tiansition-rate diagram given 
by Figure A.l. Then, the generator matrix of X is the 4X4 matrix 


A 


— 3X 3X 0 0 

0 -2X 2X 0 

0 0 —XX 

0 0 0 0 . 



Suppose, however, that the operational structure here is one that corresponds 
to triplication with voting (TMR), i.e., the function 


i 

f(i) 

1 

1 

2 

1 

3 

0 

4 

0 


Then Q~{1,0} and, applying Theorem A.2, we see that state 1 (i.e., lump {1,2}) 
satisfies condition (i) and state 0 (i.e., lump {3,4}) satisfies condition (ii). 
Hence, the functional Z is a Markov process. To determine the probabilistic 
nature of Z, let us rename state 0 (in Q) as state 2 (permitting the use of 
standard matrix notation) and let P(s,t) denote the transition function of Z, 
i.e.. 


P(s,t) - Ipjj(s.t)] (s<t) 


where 


p|j(M' - Pr[Z,-olZ,-i] . 
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Then, relative to an initial distribution p— *[pj p 2 p 3 pj for X, if we let 

d--*_ 

P 1 +P 2 ’ 

it can be shown that the matrix P(s,t) has the following entries: 




e -2A(t-»)( 2-f 2d( 1— e~ Xt )) 
(1— 2d(l— e _x *)) 


Pi2( s »0 * 1 ^Pn(s,t) , 


p 2 i(s,t) - 0 , 


(A.5) 


P22(M) - 1 . 


From the above equations, we see that the transition function P(s,t) 
depends on d and, hence, on the initial distribution p^tpj p ; p 3 p*]. Moreover, 
we observe that Z is time-homogeneous (i.e., the values of P(s,») depend only 
on the time difference t-s) only when d— 0. In other words, by the dennition 
of d, Z is time-homogeneous if and only if pj—0, i.e., there is a zero probability 
that the underlying process X is initially in state 1 (all three modules fault- 
free). However, with our interpretation of Z as a TMR model, this special case 
is pathological, and hence for most practical purposes Z will not be time- 
homogeneous. 
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Finally, turning to the question of strong lumpability (see Q3 
above), the answer can be characterized by removing condition (i) of Theorem 
A.2 and modifying the proof to accommodate this change. More precisely, we 
have 

Theorem A.3: 

Let X be a time-homogeneous Markov process with generator 
matrix A— [a^] and let Z be a functional of X determined by f. Then Z is a 
Markov process, whatever the initial distribution p of X and with a transition 
function that is independent of p, if and only if for each q«Q the following is 
satisfied: 

For all r«Q such that r*q, the sum 

2) a u < A * 6 > 

f(j)T 

is the same for all ieQ such that f(i)— q. To illustrate Theorem A.3, suppose X 
is specified by the generator matrix 

— 3X X X X 0 0 0 

0 -2X 0 0 X X 0 

0 0 -2X 0 X 0 X 

A- 0 0 0 -2X 0XX (A.7) 

0 0 0 0 0 0 0 

0 0 0 0 0 0 0 

0 0 0 0 0 0 0 . 
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and f is the function 


OF r\A?f< QL.»'J7Y 


q 

1 

2 

3 

4 

5 

6 
7 


f(q) 

1 

2 

3 

3 

4 
4 
4 


Testing condition (A.6) for states 1 and 2 in Q, we see that it holds trivially 
since these states correspond to singleton lumps. As for state 3eQ 
(corresponding to lump {3,4}), with respect to states l,2eQ the sums are zero 
for both i~3 and i~4; with respect to state 4«Q the sum is 2X for both i“«3 
and i—4. Thus condition (A.6) holds for state 3. Finally, (A.6) is likewise 
satisfied for state 4eQ and we conclude that Z is a Markov process with a 
transition function that is independent of p. 

In general, if X is strongly lumpable (as characterized by Theorem 
A.3) it is easily shown that Z must inherit the time-homogeneity of X. In 
other words, a strongly lumped processes will always be time-homogeneous 
and, accordingly, it can be specified by a constant generator matrix. More 
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precisely, let us rename the states in Q (if not already so named) with the 
integers from 1 to iQl, the generator matrix A—[a^ r ] ofZ can be constructed 
directly from A, where entry a^ r (q*r) is given by the invariant sum of 
condition (A.6) for any i such that f(i)—q. (The diagonal entries a^ are then 
determined by the condition that rows must sum to zero.) Thus, for the 
example just considered (see (A.7) and (A.8)), the generator matrix of Z is the 
4X4 matrix 


A 


— 3A A 2A 0 

0 -2A 0 2A 

0 0 -2A 2A 

. 0 0 0 0 . 


As illustrated through the above examples, a strongly lumped 
process is clearly the most desirable type of operational model. On the other 
hand, by Theorem A.3, it is evident that such models require a relatively 
restricted "match" between the probabilistic nature of the base model (as 
specified by A) and the operational structure f. 

The conditions of Theorem A.2 are somewhat * aker although, 
when satisfied, the transition rates of the resulting Markov functional are 
generally time-varying and dependent on the initial distribution of the 
underlying process. Of significance here is that even without strong lumpability 
one can obtain operational models that are Markovian and admit to feasible, 
closed-form analytic solutions (see Equations A.3, for example). What must be 
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used here, then, are solution techniques for arbitrary (discrete-state) Markov 
processes, as opposed to more special (and much more familiar) techniques 
that apply only to time-homogeneous Markov processes. 

Finally, regarding weak lumpability (Theorem A.l), the requirement 
here is even less restrictive. However, depending on A, p, and f, it may be 
difficult to decide whether the condition of Theorem A.l is satisfied. 
Moreover, we currently know of no general means of solving such models 
without resorting to detailed computations at the base model level. The utility 
of weak lumpability is also curtailed by the fact that the initial state distribution 
of the base model is fixed. This may be satisfactory in certain applications but 
one often wishes to examine the influence of different initial distributions. In 
such cases, one must derive a solution for each of the given distributions 
provided, of course, that each admits to weak lumpability. 

Theorems A.1-A.3 thus provide formal support of what we and 
others in the field have observed through experience: at higher, more user- 
oriented levels of abstraction, it is difficult to maintain a Markovian 
representation of system behavior. As a consequence, we should seek means 
for accommodating operational models (functionals) that are not Markovian. 
The latter task is less formidable than it might appear if we bear in mind that, 
when evaluating a system S, an operational model Z plays an intermediate role 
in Lupport of a specific performance variable Y. Thus, our knowledge of Z can 
be restricted to that required to solve the probability distribution of Y, i.e., the 
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performability of S. The latter observation serves as the guiding principle for 
the work described in Chapter 3. 
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